Closed esc closed 9 years ago
And eventually support for object arrays should probably move to bloscpack itself.
@mrocklin @jcrist do you have a benchmark handy for profiling text data storage with castra?
I have historically used actual datasets for this; I don't have anything artificial. @jcrist was recently working on the reddit data dumps. He might have something interesting to work with (if you're willing to download a bit of data.)
The reddit data provides a pretty good benchmark, a wide variety of string data.
This script will convert the comment data from here into a castra. The body
column is composed of ~55 million strings of varying lengths. Note that the datafile is ~5 GB compressed, and 32 GB decompressed (no need to decompress, the script does that in a streaming fashion). Conversion took around 45 minutes on my computer.
I am closing this as I won't have time to finish it for now. Perhaps another time.
Still needs to be benchmarked for speed and memory performance. Also the blosc_args can probably be tweaked.