LLNL / zfp

Compressed numerical arrays that support high-speed random access
http://zfp.llnl.gov
BSD 3-Clause "New" or "Revised" License
754 stars 152 forks source link

Why strides are ignored in zfp_field_metadata()? #230

Closed S-o-T closed 2 months ago

S-o-T commented 2 months ago

In zfp_field_metadata(), which is called during evaluation of zfp_write_header(), only field dimensions are stored, while strides are ignored. Is that a bug or strides are intentionally left unhandled?

lindstro commented 2 months ago

That is indeed intentional. Strides are a property of the organization of the data in memory as the data is compressed and written, with the compressed output being organized differently. Consumers do not necessarily want to maintain those strides when later reading and decompressing the data. Requiring that could, for instance, blow up memory requirements for the consumer if the original data is not stored contiguously. As an example, the original layout could be in array-of-struct form, perhaps with dozens of different fields being written one at a time by a simulation code. If later data analysis is to be done on a single field, you don't want to have to recreate the original layout that wastes storage on all but one field.

If the same data layout is desired during decompression, then that can be accomplished by setting strides during decompression, though the strides would have to be maintained separately.

S-o-T commented 2 months ago

Thanks for elaborated answer. The scenario i am interested in is compression of 2d vector field as two strided 2d scalar fields. I believe that in such scenario i am bound to either manually (de)interleave vector components into separate planes prior/after compression/decompression or to rely on zfp's internal accounting for strides. As you suggested, storing strides externally and providing them during decompression seems to be a solution, although, i would argue that deriving all the meta required to setup decompression directly from a header would be more convenient.

A bit tangential: what is rationale for not directly storing all but data pointer fields of zfp_field in header? It seems that cost of such header is negligible wrt compressed stream itself.

lindstro commented 2 months ago

As mentioned above, one rationale is that the consumer may not want to organize the data the same way the producer does. In fact, I cannot think of a case where the consumer, which processes the data, does not know how it wants the data to be organized. Can you think of a scenario where it would be beneficial to have the producer dictate the data layout for the consumer? In the case of a code processing 2D vector fields, the consumer needs to know if the data layout is float field[ny][nx][2] or float field[2][ny][nx] (or some other permutation) so it can index the multidimensional field properly. If the code is written with one of these conventions, it will fail if the data producer mandates the other convention. While one can write such a code using strides (e.g. field[stride_x * x + stride_y * y + stride_c * c] to access vector component c at (x, y)), oftentimes you want to use some container class like a NumPy array whose strides are given by the container, not the data producer.

Another rationale is that we have gone to great lengths to make the storage of metadata and compression parameters as compact as possible; in most cases, we encode array dimensions, scalar type, and compression mode and parameters in only 64 bits. This compact encoding is motivated by zfp's unique approach to representing large arrays as a collection of very small blocks (consisting of 4d values in d dimensions) that can be (de)compressed independently. We early on anticipated the potential to vary compression parameters spatially, perhaps even from one block to the next, and in that case the overhead of storing compression parameters becomes large. Similarly, in certain applications (like AMR), one may form a larger grid as a collection of smaller ones, with each subgrid composed of a small collection of zfp blocks. In this case, it is again important to keep array metadata per subgrid small. One may even vary precision spatially (e.g., float vs. double), where again you need an efficient way of encoding scalar type. Whereas individual array dimensions are often small (say, 16 bits or less), strides are not only signed but may span the product of all dimensions or even more (when multiple fields are interleaved), making them far costlier to encode. In practice, you often need more than 32 bits per stride, or more than 96 bits for the 2D vector field example above.

Now, I can envision a case where the consumer (perhaps an I/O module) is tasked only with reconstructing the original data bit for bit. Using the current zfp API, it would be possible to add a new ZFP_HEADER tag for strides to also store this information. The consumer could then override the strides set in zfp_read_header() before calling zfp_decompress(). The main challenge would be to do this in a backwards compatible manner as one would presumably have to redefine ZFP_HEADER_FULL to also include strides, and that would break existing code. But it may be reasonable to consider such a feature for future versions of the zfp codec. There are other changes to the compressed format we would want to incorporate, but a change to the codec will not happen anytime soon.

S-o-T commented 2 months ago

We early on anticipated the potential to vary compression parameters spatially, perhaps even from one block to the next, and in that case the overhead of storing compression parameters becomes large.

The need to design for such use-case answers my question, thanks.

Can you think of a scenario where it would be beneficial to have the producer dictate the data layout for the consumer? Now, I can envision a case where the consumer (perhaps an I/O module) is tasked only with reconstructing the original data bit for bit.

This is pretty much the case for my usage scenario.