asdf-format / asdf-standard

Standards document describing ASDF, Advanced Scientific Data Format
http://asdf-standard.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
72 stars 29 forks source link

Arrays that span multiple blocks #265

Open eslavich opened 4 years ago

eslavich commented 4 years ago

This idea comes from a comment in the readme of @eschnett's asdf-cxx project:

It would be interesting to be able to split arrays into multiple blocks. This would allow tiled representations (which can be much faster for partial reading), and would allow not storing large masked regions.

perrygreenfield commented 4 years ago

Isn't this basically supporting chunking?

eschnett commented 4 years ago

Yes it is.

perrygreenfield commented 4 years ago

I guess the question becomes whether the chunks should be in separate blocks. Yes, if they are compressed; is that a common usage? I worry that a large number chunks listed in the YAML will affect performance. I should see how zarr does it as an example, unless you already know.

eschnett commented 4 years ago

I have not thought about the implementation much. My impression is that the current block format is already a bit complex (but still straightforward and easy to encode/decode), and adding a chunking layer on top of this would just increase complexity. My idea (at the time) was to define my own chunking mechanism (outside of the ASDF standard) on top of these blocks, and this would be done in YAML. A "chunked block" definition would describe the overall array shape/size, the chunking granularity, and have pointers to blocks.

Maybe chunked arrays could be stored as arrays where the entries are chunk ids? These arrays can then be stored either as yaml or as blocks.

Compression (or checksumming) seem like a good feature to have. zlib and bzip2 by themselves don't compress much, but there are compression algorithms that are adapted to the floating point representation and which perform much better.

perrygreenfield commented 4 years ago

Given the fact that zarr supports compressed chunks, I don't think there is any sensible way of having the chunks within or or many asdf binary blocks that would be efficient. Zarr handles it by putting each chunk in a separate file (or using a database to handle the chunks) and appears to be the only practical solution given unknowable size of chunks if they are writable. It would be possible to store all the chunks in an ASDF file so long as no compression is used, or it is only read only (e.g., an archived file). But as a working file that can be updated, the zarr approach is the only practical one. I'm going to think a bit more about how we could leverage zarr efficiently. I don't think its approach precludes support in other languages.

eschnett commented 4 years ago

The use case that interests me and many of my collaborators are immutable files. These files are written once, and if they are processed, they are unchanged; instead, additional files are created. That is, instead of treating a file as a mutable database, it is treated as a snapshot. Thus chunks that change in size or number (arrays that are written, resized, or new arrays that are added) are not relevant. However, pointers between files are quite convenient to have.

I don't know how large a fraction of the community would use ASDF the same way.

perrygreenfield commented 4 years ago

I think both ways are useful. Immutable data could be supported as well saving data within the file. I'll see if I can come up with an outline for both approaches that can use zarr for the interface to the data.

perrygreenfield commented 4 years ago

@eschnett we've been discussing this quite a bit internally to come up with proposals for how to deal with these kinds of cases and are beginning to firm up our ideas on this. First I'm going to post soon a proposal regarding how to handle extended kinds of compression first since chunking implementations will be layered on that.

PaulHuwe commented 4 years ago

This is a must have for RST, given the large variables.

PaulHuwe commented 4 years ago

On a related performance note, block processing will be important - (not loading full variables into memory for processing). I can make this a separate issue, if desired.

eschnett commented 4 years ago

Block processing (i.e. traversing large arrays block-by-block, or traversing only a caller-specified subset of an array) is part of the API, not the data format. That is, isn't that a question for an implementation, not the standard?

perrygreenfield commented 4 years ago

It could be either. If using chunking, it would be the data format, otherwise, the API. That's one reason to understand which is being asked for. (I tried to raise the issue in a previous comment but it wouldn't allow me at the time). API options are memory mapping or reading in a range of a block.

PaulHuwe commented 4 years ago

Yeah, I am advocating for both (though I can see where block processing in the API should be raised elsewhere - I only noted it here due to it being somewhat related).

perrygreenfield commented 4 years ago

Both being chunking and the other options? Or just the last two? For RST I'd say chunking is more consistent with the cloud model.

PaulHuwe commented 4 years ago

Chunking & block processing. You are correct that chunking is more consistent with the cloud model, whereas both chunking and block processing are important for data manipulation outside of the cloud.