asdf-format / asdf

ASDF (Advanced Scientific Data Format) is a next generation interchange format for scientific data
http://asdf.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
523 stars 57 forks source link

Chunking support #1783

Open eschnett opened 5 months ago

eschnett commented 5 months ago

When a large ndarray is stored as binary block with compression, then the (beginning of) the whole block needs to be read and decompressed even when only a small subarray is read. "Chunking" remedies this; instead of storing an ndarray as a single binary block, it is stored as a set of smaller blocks that are compressed and stored independently.

Are there plans to support this? Can this be implemented as extension?

One simple approach would be to introduce a new yaml tag core/chunked-ndarray that consists of a yaml map that maps offsets to ndarrays, for example

chunky: !core/chunked-ndarray-1.0.0
  - !core/ndarray-chunk-1.0.0
    offset: [0,0]
     data: !core/ndarray-1.0.0
       source: ... # the usual ndarray stuff here
  - !core/ndarray-chunk-1.0.0
    offset: [100,0]
     data: !core/ndarray-1.0.0
       source: ... # the usual ndarray stuff here
  - !core/ndarray-chunk-1.0.0
    offset: [0,100]
     data: !core/ndarray-1.0.0
       source: ... # the usual ndarray stuff here
  # possibly more chunks here

Has there been any work in this direction?

braingram commented 5 months ago

Thanks for opening this issue.

There has been some work adding support for the zarr storage format within ASDF. This is implemented via an extension: https://github.com/asdf-format/asdf-zarr It's a new package so please let me know if it's something you plan to use "in production" (so we can give it another review, also feel free to give it a try and open issues if you find anything). The extension offers a few options:

The use of zarr also opens up a second place where compression can be controlled (which can get a bit confusing).

eschnett commented 5 months ago

@braingram Nice! We are currently discussing storage formats, and both ASDF and Zarr are contenders that have various advantages and disadvantages. On the surface, using Zarr chunking with ASDF single-file storage seems like an excellent choice. I will have a look.