38 / d4-format

The D4 Quantitative Data Format
MIT License
150 stars 20 forks source link

d4toh5.py used for manuscript? #40

Closed ivirshup closed 2 years ago

ivirshup commented 2 years ago

Hi!

I'm trying to get a better understanding of this methods/ formats performance benchmarks, and had a question about the hdf5 results in the manuscript. Is the d4toh5.py (called here) available somewhere?

I saw there is a script d4-format/pyd4/examples/d4toh5.py, but I suspect a different one was used for the manuscript as this script should write an uncompressed HDF5 file.

Thanks!

arq5x commented 2 years ago

@38 could you please look into this?

38 commented 2 years ago

Hi @ivirshup , thanks for trying D4.

Actually this is intentionally that we make the HD5 file uncompressed for following reason. D4 isn't designed to be most space efficient as our manuscript claims - it's a balanced solution for both reasonable file size and high sequential and random access speed.

  1. If you look at our file size comparison, you may realize that D4 file size is actually larger than uncompressed HD5 file size. And basically you could expect an compressed HD5 is even more space efficient, but this is not what D4 is designed for.
  2. When we compare to the performance, we should guarantee the accessing performance is the best. So for HD5 format, it's not fair if we compare the performance of compressed HD5 and the performance of D4 file.

So that's why we intentionally make HD5 file uncompressed, as we focus on the speed - D4 file is designed for hot data.

Please let me know if you have any questions.

Thanks!

ivirshup commented 2 years ago

Thanks for getting back to me!

I might be a little bit confused by what the format of the HDF5 file used here is. My impression that the hdf5 file used in the benchmarks is compressed came from these bits of the paper:

In this way, we estimate that chromosome 1 of the human genome, which contains 249 million bases, will consume 212 megabytes for a D4 encoding of WGS data, as (249,000,000 × 6 bits + (0.01 × 249,000,000 × 80 bits)) < 181 MiB. In contrast, if we were to store each depth as an unsigned 32 bit integer, 996 megabytes would be required for the same data.

Since a memory mapped array of larger unsigned integers is essentially what I expected an uncompressed hdf5 format to be (plus chunking and metadata).

And:

For both WGS and RNA-seq datasets, D4 yielded a 10-fold faster file creation time, and, with the exception of the highly compressed HDF5 format, yielded the smallest file size.

I'd also note that I've seen writing repetitive small integers to hdf5 with gzip compression being terrible performance edge case. The "off the scale" write times here reminded me of that.


What I'm really trying to figure out here is if once could use an equivalent storage pattern inside some array store (e.g. fixed small bitwidth arrays, plus a secondary encoding table) and get similar performance, or if I'm missing something else about this format.

I think I may just need to figure out compiling this and checking it out.