datalab-dev / quintessence_analysis

All the scripts we use for analysis
0 stars 0 forks source link

Save all necessary data when running topic model #18

Closed avkoehl closed 3 years ago

avkoehl commented 3 years ago

It seems gensim's lda wrapper will write the model files (such as the corpus) to temp files in /var/. This is a problem because we need that data for computing topic proportions, and potentially document term topic assignments.

To explore: may want to use gensim's LDA model instead of Mallet.

Short term though, will need to save the corpus object, dictionary object, and filenames if necessary.

Change topicmodel init to have output directory instead of model_path argument. The output directory will be where the corpus, dictionary and filenames are saved as well as the model object. Modify load_model to include those, add corpus, dictionary, and filenames as properties of the class as well

sampizelo commented 3 years ago

not sure if gensim's various wrappers are designed similarly, but Chandni was able to change this temp file save location for my DIM wrapper with prefix='/dsl/eebo/dim/' when calling the model function.

avkoehl commented 3 years ago

not sure if gensim's various wrappers are designed similarly, but Chandni was able to change this temp file save location for my DIM wrapper with prefix='/dsl/eebo/dim/' when calling the model function.

great call. prefix is indeed a param to the ldamallet wrapper and defines where the temp files are saved!