bernhardmgruber commented 3 years ago

Motivated by use cases in databases, LLAMA blobs could utilize leightweight compression algorithms like PFOR (Patched Frame of Reference). While reduction of blob sizes is an interesting side effect, the main motiviation comes from reducing the memory bandwidth when data is read through the memory hierarchy since decompression should be very local and very close to the compute units.

This could be implemented based on computed fields #170, but probably does not play well with random access.

bussmann commented 3 years ago

There is an interesting paper on compression and throughput by @ax3l .

ax3l commented 3 years ago

Thanks for the ping. I think trying various lightweight delta-based methods could be a nice use case for LLAMA data abstraction. https://docs.actian.com/vector/5.0/index.html#page/User/Data_Type_Storage_Format_and_Compression_Type.htm

You could also experiment with (small) block-wise compression on device with various experimental compressors.

Whatever you choose, you can estimate if it's worth it to do so (aka: will compute overhead not outrun your transfer savings in terms of time spent) with a performance model that we published in DOI:10.1007/978-3-319-67630-2_2 (arXiv:1706.00522). I mention also a few variations of the model in the slides, e.g. zero-copy or zero-prep time. And if it's not worth it walltime-wise but the overhead stays reasonable, you could still find cases where it might be worth it, because you can fit larger sims into constrained resources (HPC systems still have very limited HBM on-device for large 3D sims).

bernhardmgruber commented 3 years ago

Thanks for the link to the paper! I have skimmed through it and I think it is useful to guide some choices when to use compression and using which technology. For the time being that choice relies with the user. When LLAMA will get an automated mapping chooser, we could employ such a model to make informed decisions on what to choose based on target hardware and access pattern. So I will revisit your model later :)

For compression in general, I want to provide the necessary infrastructure in LLAMA, independently of that being useful for specific scenarios or not. The core problem with compression is that you easily loose efficient random access to a LLAMA data structure. Efficient random access is also stateless wrt. to the mapping implementation. Using some form of block-wise compression needs to restrict the access pattern in some ways. E.g. linear forward/backward traversal. Then a LLAMA mapping could become stateful and store/cache decompressed blocks. So in principle, it is doable but it changes the current lightweightedness of mappings and propagages specific access patterns to the accessing program. And I have not yet had enough time to figure out how that interplays.

bernhardmgruber commented 2 years ago

The bitpacking mappings partially address this issue: #420.

bussmann commented 2 years ago

Please involve Fabian Koller in this discussion. He should knuw and can help!

psychocoderHPC commented 2 years ago

Please involve Fabian Koller in this discussion. He should knuw and can help!

Why do you mention Fabian Koller (now with LogMeIn)? He wrote the first openPMD implementation, what do you have in mind where he could contribute?

bernhardmgruber commented 2 years ago

AFAIK, he is working on a state of the art analysis in (lossy) compression for his diploma. We had a VC together where we discussed the potential use of LLAMA and a HEP analysis with ROOT (CERN) for an applied part of his work. We agreed that he will come back to us after figuring out the scope of his work and whether LLAMA or ROOT fits the scope.

bussmann commented 2 years ago

I opened a Mattermost channel for you under CASUS to discuss these things. Fabian is eager to help implementing this and his C++ kung fu is great.

bernhardmgruber commented 2 years ago

427 merged two new mappings that can bitpack integer and FP values.

alpaka-group / llama

Compressed blobs #294

427 merged two new mappings that can bitpack integer and FP values.