ajenhl / tacl

Tool for performing basic text analysis on the CBETA corpus
GNU General Public License v3.0
30 stars 9 forks source link

Break up results to save memory #49

Open ajenhl opened 8 years ago

ajenhl commented 8 years ago

Some operations are memory hogs when operating on large results. For example, diff-reduce starts with one pandas DataFrames for all of the results, and gradually builds up another DataFrame with a subset of those results. When the CSV file for the full results is several G in size, this ends up using a lot of RAM.

It is probably worth breaking the results into chunks where possible, and writing out to disk. So, for example, diff-reduce could append each processed group of results to a file as CSV rather than keeping them in memory.