BabakHemmatian / Gay_Marriage_Corpus_Study

LDA and RNN for Reddit comments
0 stars 0 forks source link

Parallelize parsing and sampling #8

Closed sabjoslo closed 5 years ago

sabjoslo commented 6 years ago

@BabakHemmatian, is this issue to parallelize the sampling of random comments (i.e. what's executed by Parser().select_random_comments())?

BabakHemmatian commented 6 years ago

Back when we created this issue, we didn't have a file to which the original comments were written. So the final function that was supposed to pick out the original text of the top comments for top topics had to iterate through the entire database in a serial manner, which would have taken around 2 days. I feel like with the new parallelized parser and the original relevant comments having already been written to file, parallelizing the sampled comment writer is not a priority at all.

As a side note, I think a threaded version can be more efficient for the parser than a multi-processor version. That's because the manipulations we do on the data for parsing are not super expensive and we could have 10 or so threads running simultaneously on one CPU, while the possible gain for multi-processing given a 4-core system is limited to 3x. What do you think? Regardless, this is a minor optimization point and we have other stuff we can focus on.

sabjoslo commented 6 years ago

As a side note, I think a threaded version can be more efficient for the parser than a multi-processor version.

That would be great, but doesn't Python's GIL prohibit threading?

BabakHemmatian commented 6 years ago

multiprocessing module can thread (based on threading module), using specific cPython C extension modules that properly release the GIL to run in parallel. It is slower computation-wise, but since most of the processing time for our parsing seems to be I/O, it might still help.