GateNLP / gateplugin-LearningFramework

A plugin for the GATE language technology framework for training and using machine learning models. Currently supports Mallet (MaxEnt, NaiveBayes, CRF and others), LibSVM, Scikit-Learn, Weka, and DNNs through Pytorch and Keras.
https://gatenlp.github.io/gateplugin-LearningFramework/
GNU Lesser General Public License v2.1
26 stars 6 forks source link

Mallet LDA improvements/changes #83

Open johann-petrak opened 6 years ago

johann-petrak commented 6 years ago

A bunch of improvements, changes and checks to do, collected into this single issue:

  1. make sure we save the right data for later topic inference:
    • model file: apparently just for resuming training
    • inference file: for application on new documents
    • gibbs state: why/when exactly do we need this?
  2. make sure we use hyperparameter optimization
  3. check what takes so long, is it calculating the diagnostics? If yes, make saving that optional
  4. find word per topic probs without running the diagnostics
  5. save separate files after training:
    • words and their weights per topic (ntopics*nwords lines with topicnr, word, weight)
    • topic importance (ntopics lines with topicnr, topic weight)
    • if we run inference: documents and their topic weights (ndocuments*ntopics lines with docname, topicnr, weight), optional?
    • the full diagnostics file, optionally
  6. save separate files after application:
    • documents and their topic weights (as for training), optional?
  7. Parameter that specifies a feature name prefix for storing global info on every document's document features (not the document annotation!)
  8. Provide a Groovy script for deriving k-best topic index lists from the topic distribution based on:
    • which topics have highest prob
    • is the prob > than some threashold
johann-petrak commented 6 years ago

Ad 1.:

johann-petrak commented 6 years ago
  1. make sure we use hyperparameter optimization:
    • this should happen always anyways, optimizeInterval is set to 50 in ParallelTopicModels
    • however when testing with the Mallet command line tool train-topics, if --optimize-interval 50 is specified, we get interesting topic weights, while not using that parameter gives equal weights and looks indentical to --optimize-interval 0
johann-petrak commented 6 years ago
  1. check what takes so long
    • yes it is calculating the topic model diagnostics, make calculating and storing the file optional, default no
johann-petrak commented 6 years ago
  1. Has been implemented
johann-petrak commented 6 years ago

5 has been implemented

johann-petrak commented 6 years ago

Still missing: