Closed avkoehl closed 3 years ago
First set of changes was to simplify inputs and outputs to topicmodel and embeddings class; as well as get rid of the 'corpus' classes. As well as move towards pandas dataframes
Future changes that would be nice to optimize performance - use 'modin' #61 move text out of dataframe (will be much more memory efficient) #51 use parquet instead of csv for reading and writing #50 fix gensim bow creator to actually be faster when parallelized #53
Overall worked okay. Some things to note. Overall run - get data, preprocess, model, analyze x2 takes about 2 days.
corpus -> dtm about 2.5 hrs topic model with 12 cores -> 2.5 hrs uses 100GB Ram topic model with 4 cores -> 7 hrs
embedding takes about 8 hours (I think) uses 145 GB Ram embedding parsing takes 4 hours or so word frequency takes at least 4 hours