Open jreadey opened 8 years ago
This is the paper I found: Optimal Chunking of Large Multidimensional Arrays for Data Warehousing.
Made a module for different chunk size algorithms (commit: 87b503cb). The idea is to have them as importable functions.
How feasible would it be to extend this to chunk-ify a file?
I envisage using these functions for getting first guess chunk size advice and then applying it (or some other size) to a particular dataset in a file. What do you have in mind?
I'm thinking of an "autochunk.py" script that would take a file as input, iterate through the datasets, get the best guess chunklayout, and output a new file.
Useful option would be an interactive mode (user confirms each chunk layout).
That's the next step. :-) And perhaps select which optimal chunk algo to use?
From our results so far, chunking+compression is generally slower than no-chunkng+no-compression for the ncep dataset. I suspect this is due to the decompression time.
We should create a test dataset with chunking but no compression for comparison.
Likely the cortad dataset (being one big 3d dataset) will generate a more interesting basis of comparison than the ncep dataset.
Perhaps chosen chunk size is not that much optimal (good).
Is there a chunking algorithm that defines the optimal chunk layout?