Closed sabjoslo closed 5 years ago
Back when we created this issue, we didn't have a file to which the original comments were written. So the final function that was supposed to pick out the original text of the top comments for top topics had to iterate through the entire database in a serial manner, which would have taken around 2 days. I feel like with the new parallelized parser and the original relevant comments having already been written to file, parallelizing the sampled comment writer is not a priority at all.
As a side note, I think a threaded version can be more efficient for the parser than a multi-processor version. That's because the manipulations we do on the data for parsing are not super expensive and we could have 10 or so threads running simultaneously on one CPU, while the possible gain for multi-processing given a 4-core system is limited to 3x. What do you think? Regardless, this is a minor optimization point and we have other stuff we can focus on.
As a side note, I think a threaded version can be more efficient for the parser than a multi-processor version.
That would be great, but doesn't Python's GIL prohibit threading?
multiprocessing
module can thread (based on threading
module), using specific cPython C extension modules that properly release the GIL to run in parallel. It is slower computation-wise, but since most of the processing time for our parsing seems to be I/O, it might still help.
@BabakHemmatian, is this issue to parallelize the sampling of random comments (i.e. what's executed by
Parser().select_random_comments()
)?