run on full 60K documents corpus to find performance pain points

datalab-dev / quintessence_analysis

All the scripts we use for analysis

0 stars 0 forks source link

run on full 60K documents corpus to find performance pain points #42

Closed avkoehl closed 3 years ago

avkoehl commented 3 years ago

Overall worked okay. Some things to note. Overall run - get data, preprocess, model, analyze x2 takes about 2 days.

corpus -> dtm about 2.5 hrs topic model with 12 cores -> 2.5 hrs uses 100GB Ram topic model with 4 cores -> 7 hrs

embedding takes about 8 hours (I think) uses 145 GB Ram embedding parsing takes 4 hours or so word frequency takes at least 4 hours

avkoehl commented 3 years ago

First set of changes was to simplify inputs and outputs to topicmodel and embeddings class; as well as get rid of the 'corpus' classes. As well as move towards pandas dataframes

avkoehl commented 3 years ago

Future changes that would be nice to optimize performance - use 'modin' #61 move text out of dataframe (will be much more memory efficient) #51 use parquet instead of csv for reading and writing #50 fix gensim bow creator to actually be faster when parallelized #53