blaze / castra

Partitioned storage system based on blosc. **No longer actively maintained.**
BSD 3-Clause "New" or "Revised" License
153 stars 21 forks source link

Loading a column in castra slower in parallel #32

Closed jcrist closed 9 years ago

jcrist commented 9 years ago

castra.to_dask('col').compute() is slower than castra[:, 'col'], and user significantly more RAM. For my sample dataset, loading ~400 MB of text used a peak of 2 GB when loaded straight from castra, and a peak of 12 GB when loaded through dask. This was not seen when using `.compute(get=get_sync).

I was unable to determine if this was specific to object dtype, as the numeric data loaded in ms (compared to several seconds for strings). An intermediate solution (as discussed with @mrocklin) might be to put a lock around the object branch in unpack_file, thus serializing these requests.

mrocklin commented 9 years ago

I suspect that there is some expansion/bloat when going from disk to a pandas series with text. The dask solution is worse only because it does this in parallel and so amplifies the bloat. I suspect that the problem is somewhere within msgpack or pandas. I think that the next step to identify the issue would be to run a memory profiler over a single run of the load_partition function. We might have to expand some nested function calls into individual lines to find the culprit.

jcrist commented 9 years ago

Just tried this with a larger set of numeric data, which resulted in the same problem.

mrocklin commented 9 years ago

That shifts blame away from msgpack and possibly onto bloscpack. Pandas is still in the running. @esc any chance that blosc(pack) uses a lot of memory while decompressing? This seems unlikely but I thought I'd ask.

jcrist commented 9 years ago

Closed by #33.

esc commented 9 years ago

Bloscpack is chunked in nature so it can be nice to memory. Dealing with object arrays is more tricky since we are currently serializing the whole thing and then giving it to blosc IIRC. This may be bad for memory.

esc commented 9 years ago

What was the problem in the end? The numpy constructor?

jcrist commented 9 years ago

It was the numpy constructor amplified by dask. Numpy was using lots of memory constructing object arrays, and dask was parallelizing that by 8.