Closed jreadey closed 8 years ago
What chunk size should I use? Which dataset in NCEP3? Please be more specific. What would be the name of bucket/directory to save?
See issue #8 for a discussion on determining the chunksize.
The NCEP3 dataset is on the OSDC cluster here: s3://hdfdata/ncep3/.
Let store different compression formats/chunksizes like so:
s3://hdfdata/ncep3
@hyoklee Do a s3cmd ls s3://hdfdata/ncep3/
command. All those HDF5 file, or more precisely all the datasets in those file, should be chunked.
As for the chunk size, we obviously don't have a tool that can give us an ultimate answer. I created a notebook you can use to plug in particular HDF5 dataset sizes and get suggestions for chunk sizes. We can then decide which suggestion we like more.
Do you want to re-chunk / re-compress all datasets? Does your summary script work on a particular dataset?
Do you want to re-chunk / re-compress all datasets?
All datasets should be rechecked re-chunked. (Thanks, @jreadey!)
re-chunked!
Do you want to re-chunk/compress with h5py script or can I use h5repack command line tool that can understand filters?
Sounds like you think h5repack is a better tool? I have no preference.
I started putting chunked version (h5py suggested size) into:
s3://hdfdata/ncep3_chunk_45_180_gzip_9/
Please check if the files are re-chunked in a way that meets your need.
Looks like we're getting good compression. Less than half the original size.
How long did it take to compress all the files?
@hyoklee - can you document the work by checking in the script that did the compression? We can create a folder like "transform" or something to put in all the codes we use.
It took 20 min per year so 1987-2008 will take 22 * 20min = 440min. I ran it overnight and all files are re-chunked and compressed.
It took 6 hours to repack 7850 files using Unidata heuristic.
(py34)ubuntu@test2:~$ s3cmd ls s3://hdfdata
DIR s3://hdfdata/cortad/
DIR s3://hdfdata/ncep3/
DIR s3://hdfdata/ncep3_chunk_22_46_gzip_9/
DIR s3://hdfdata/ncep3_chunk_45_180_gzip_9/
For details, look at wiki: https://github.com/HDFGroup/datacontainer/wiki/User-Guide
Chunkify NCEP3 dataset and save to object store.