HDFGroup / datacontainer

Data Container Study
Other
8 stars 1 forks source link

Chunking Alogrithm #8

Open jreadey opened 8 years ago

jreadey commented 8 years ago

Is there a chunking algorithm that defines the optimal chunk layout?

ghost commented 8 years ago

This is the paper I found: Optimal Chunking of Large Multidimensional Arrays for Data Warehousing.

ghost commented 8 years ago

Made a module for different chunk size algorithms (commit: 87b503cb). The idea is to have them as importable functions.

jreadey commented 8 years ago

How feasible would it be to extend this to chunk-ify a file?

ghost commented 8 years ago

I envisage using these functions for getting first guess chunk size advice and then applying it (or some other size) to a particular dataset in a file. What do you have in mind?

jreadey commented 8 years ago

I'm thinking of an "autochunk.py" script that would take a file as input, iterate through the datasets, get the best guess chunklayout, and output a new file.

Useful option would be an interactive mode (user confirms each chunk layout).

ghost commented 8 years ago

That's the next step. :-) And perhaps select which optimal chunk algo to use?

jreadey commented 8 years ago

From our results so far, chunking+compression is generally slower than no-chunkng+no-compression for the ncep dataset. I suspect this is due to the decompression time. We should create a test dataset with chunking but no compression for comparison.
Likely the cortad dataset (being one big 3d dataset) will generate a more interesting basis of comparison than the ncep dataset.

ghost commented 8 years ago

Perhaps chosen chunk size is not that much optimal (good).