LifeboatLLC / SparseHDF5

Sparse storage in HDF5
3 stars 1 forks source link

Thoughts on sparse data proposal #1

Open markcmiller86 opened 1 year ago

markcmiller86 commented 1 year ago

I am not entirely certain I understand the requirements being addressed by this work. I read the requirements section of this paper. First, if this statement is true... Further assume that for each image, it is possible to identify automatically either (see figure 1):, then why isn't it possible to simply store the image to HDF5 using a good compression filter and then use whatever thingy does the "automatic identification of ROI's and clusters" to reconstruct that information on demand when needed? In other words, if that automatic identification is available, just store the original (dense) image (again using a good, possibly domain-specific compression filter) to HDF5.

Next, this basic problem is IMHO not very different from the relationship between structured meshes (dense case) and UNstructured meshes. In the UNstructrued case, you have explicit topology (e.g. element to element connectivities) you need to maintain. The answer is that it requires multiple datasets in HDF5 to handle properly. Much like https://github.com/appier/h5sparse. So, I am perplexed why we are trying to take a perfectly well defined interface for the dense case and try to shoehorn sparse underneath it as opposed to simply defining a convention for the sparse case out of multiple datasets. The artible above says...

While this approach maintains the level of abstraction for library users, it is abstraction defeating to the rest of the ecosystem. Substantial changes to existing tools and applications would be required.

First, any existing tool in the eco-system that might have a shot at doing something with this data will do so only under the conditions that it can treat it like a normal, dense 3D array. In other words, none of the existing eco-system you mention will have any knowledge or desire to treat the data in terms of ROIs or clusters. That new way of looking at the data is known only to tools that will be embued with the knowledge to handle it.

Next, this basic problem is why products like Silo and Exodus exist...not everything is a regular array. These products were designed to create the needed abstractions by using conventions in the use of HDF5 "primitives".

What is being proposed is a significant departure from the existing data model and it is not well motivated. This is especially true when reasonable alternatives (as described in the paper) exist. I think there should be some kind of time/space performance study in which you compare an unadulterated HDF5 but perhaps with a highly tuned, created specifically for this purpose compression filter against the high level library option (both of which represent very tiny investments of work...like probably a couple of weeks at most) and get a better handle on what the time/space performance tradeoffs are going to be.

This remark in README.md, Also, storing and accessing sparse datasets as dense datasets, when read into memory (and after decompression), may result in a huge memory footprint. is what any library that supports memory-resident compressed data (e.g. ZFP for example) also needs and the solution so far proposed by HDF5 is to use the direct write methods for those cases.

Also wanted to capture these refs in case they are useful...

epourmal commented 1 year ago

@markcmiller86

Mark,

Thank you for sharing your thoughts and my sincere apology for the delay with documenting our discussion.

In other words, if that automatic identification is available, just store the original (dense) image (again using a good, possibly domain-specific compression filter) to HDF5.

You are absolutely correct that compression can be used to save storage space and the high-level libraries offer solutions, but the proposal is not about saving storage space. It is about data portability and new opportunities to solve the existing problems.

Automatic detection of ROI is available only on write, i.e., after data is written, the locations of the written elements are unknown unless application saves it in the file. Our proposal addresses exactly this problem - locations are stored with the data in a portable self-describing way. Of course, there are many applications (and you provided the examples, e.g., Silo, Exodus, h5sparse) that have already addressed the issue of storing sparse data. But the files created by those applications are not portable in the sense that a reader needs to know the schema used for storing sparse data.

Proposed Structured Chunk storage allows us to implement not only sparse data storage but also address the storage of variable-length (VL) data in HDF5. We will be able to compress VL data and write it in parallel that is not possible now. (When VL data is stored as a compressed dataset, what is actually compressed is a dataset of "encoded pointers" to VL data that is stored in the heaps, and the heaps are not compressed).

Please notice that programming model doesn't change. Existing APIs can be used to manage sparse and variable-length data. New APIs provide specific functionality for new type of data (e.g., find locations, erase existing data, write/read chunk structured chunks). Current applications will work without changes when reading data stored using structured chunks.

I will rework the RFCs to provide better motivation and to emphasize the problem the proposal addresses.

Thank you once more for your comments. I will leave the issue open until I have the new versions of RFC. I hope to get more comments from the community.