Closed jcrist closed 9 years ago
I suspect that there is some expansion/bloat when going from disk to a pandas series with text. The dask solution is worse only because it does this in parallel and so amplifies the bloat. I suspect that the problem is somewhere within msgpack
or pandas
. I think that the next step to identify the issue would be to run a memory profiler over a single run of the load_partition
function. We might have to expand some nested function calls into individual lines to find the culprit.
Just tried this with a larger set of numeric data, which resulted in the same problem.
That shifts blame away from msgpack and possibly onto bloscpack. Pandas is still in the running. @esc any chance that blosc(pack)
uses a lot of memory while decompressing? This seems unlikely but I thought I'd ask.
Closed by #33.
Bloscpack is chunked in nature so it can be nice to memory. Dealing with object arrays is more tricky since we are currently serializing the whole thing and then giving it to blosc IIRC. This may be bad for memory.
What was the problem in the end? The numpy constructor?
It was the numpy constructor amplified by dask. Numpy was using lots of memory constructing object arrays, and dask was parallelizing that by 8.
castra.to_dask('col').compute()
is slower thancastra[:, 'col']
, and user significantly more RAM. For my sample dataset, loading ~400 MB of text used a peak of 2 GB when loaded straight from castra, and a peak of 12 GB when loaded through dask. This was not seen when using `.compute(get=get_sync).I was unable to determine if this was specific to object dtype, as the numeric data loaded in ms (compared to several seconds for strings). An intermediate solution (as discussed with @mrocklin) might be to put a lock around the object branch in
unpack_file
, thus serializing these requests.