HDFGroup / datacontainer

Data Container Study
Other
8 stars 1 forks source link

Create chunked version of NCEP3 #11

Closed jreadey closed 8 years ago

jreadey commented 8 years ago

Chunkify NCEP3 dataset and save to object store.

hyoklee commented 8 years ago

What chunk size should I use? Which dataset in NCEP3? Please be more specific. What would be the name of bucket/directory to save?

jreadey commented 8 years ago

See issue #8 for a discussion on determining the chunksize.

The NCEP3 dataset is on the OSDC cluster here: s3://hdfdata/ncep3/.

Let store different compression formats/chunksizes like so: s3://hdfdata/ncep3/

ghost commented 8 years ago

@hyoklee Do a s3cmd ls s3://hdfdata/ncep3/ command. All those HDF5 file, or more precisely all the datasets in those file, should be chunked.

As for the chunk size, we obviously don't have a tool that can give us an ultimate answer. I created a notebook you can use to plug in particular HDF5 dataset sizes and get suggestions for chunk sizes. We can then decide which suggestion we like more.

hyoklee commented 8 years ago

Do you want to re-chunk / re-compress all datasets? Does your summary script work on a particular dataset?

ghost commented 8 years ago

Do you want to re-chunk / re-compress all datasets?

All datasets should be rechecked re-chunked. (Thanks, @jreadey!)

jreadey commented 8 years ago

re-chunked!

hyoklee commented 8 years ago

Do you want to re-chunk/compress with h5py script or can I use h5repack command line tool that can understand filters?

ghost commented 8 years ago

Sounds like you think h5repack is a better tool? I have no preference.

hyoklee commented 8 years ago

I started putting chunked version (h5py suggested size) into:

s3://hdfdata/ncep3_chunk_45_180_gzip_9/

Please check if the files are re-chunked in a way that meets your need.

jreadey commented 8 years ago

Looks like we're getting good compression. Less than half the original size.

How long did it take to compress all the files?

@hyoklee - can you document the work by checking in the script that did the compression? We can create a folder like "transform" or something to put in all the codes we use.

hyoklee commented 8 years ago

It took 20 min per year so 1987-2008 will take 22 * 20min = 440min. I ran it overnight and all files are re-chunked and compressed.

hyoklee commented 8 years ago

It took 6 hours to repack 7850 files using Unidata heuristic.

(py34)ubuntu@test2:~$ s3cmd ls s3://hdfdata
                       DIR   s3://hdfdata/cortad/
                       DIR   s3://hdfdata/ncep3/
                       DIR   s3://hdfdata/ncep3_chunk_22_46_gzip_9/
                       DIR   s3://hdfdata/ncep3_chunk_45_180_gzip_9/

For details, look at wiki: https://github.com/HDFGroup/datacontainer/wiki/User-Guide