blaze / castra

Partitioned storage system based on blosc. **No longer actively maintained.**
BSD 3-Clause "New" or "Revised" License
153 stars 21 forks source link

Start using bloscpack for the text serialization too. #34

Closed esc closed 9 years ago

esc commented 9 years ago

Still needs to be benchmarked for speed and memory performance. Also the blosc_args can probably be tweaked.

esc commented 9 years ago

And eventually support for object arrays should probably move to bloscpack itself.

esc commented 9 years ago

@mrocklin @jcrist do you have a benchmark handy for profiling text data storage with castra?

mrocklin commented 9 years ago

I have historically used actual datasets for this; I don't have anything artificial. @jcrist was recently working on the reddit data dumps. He might have something interesting to work with (if you're willing to download a bit of data.)

jcrist commented 9 years ago

The reddit data provides a pretty good benchmark, a wide variety of string data.

This script will convert the comment data from here into a castra. The body column is composed of ~55 million strings of varying lengths. Note that the datafile is ~5 GB compressed, and 32 GB decompressed (no need to decompress, the script does that in a streaming fashion). Conversion took around 45 minutes on my computer.

esc commented 9 years ago

I am closing this as I won't have time to finish it for now. Perhaps another time.