gzip? - Githubissues

arogozhnikov commented 11 months ago

Hi, this is a random person who got this repo in github recommendations.

Comment1: cool and nice, but ungzipped matrix market is a very expensive option. Comparison with gzip is more valuable IMO

Comment2: And as a baseline, it's valuable to see something terribly simple based on columnar format, e.g.:

import polars as pl
from scipy import sparse
import numpy as np

x = sparse.coo_array(np.random.randn(1000, 1000)) # not sparse, I know

# save to parquet
pl.DataFrame(dict(col=x.col, row=x.row, data=x.data)).write_parquet('/tmp/sparse_coo.pqt')

df = pl.read_parquet('/tmp/sparse_coo.pqt')
x2 = sparse.coo_array((df['data'].to_numpy(), (df['row'].to_numpy(), df['col'].to_numpy())))

np.allclose(x.todense(), x2.todense())

and see how these compares in terms of speed (because from perspective of floating precision binary formats certainly win)

alugowski commented 11 months ago

Thanks for the feedback!

Comment1: cool and nice, but ungzipped matrix market is a very expensive option. Comparison with gzip is more valuable IMO You're right that Matrix Market files are both large and compress well.

Curious, what's your workflow that uses gzipped files? I'm only aware of one popular sparse matrix library that can load gzipped files directly, SciPy, though FMM can be easily configured to do so (and the Python bindings do that already). All the others accept a plain .mtx file and will choke on anything not uncompressed.

It's probably worthwhile to add some compression benchmarks as well. I'll include GZip, but only to show how slow and a bottleneck it is. Something like lz4 or zstd would fit this application better.

Comment2

Good point, I'll add something like that we I get a bit of time. I'm curious how it would go myself.

arogozhnikov commented 11 months ago

single-cell-whatever, like single-cell rnaseq, single-cell atac, and other omics assays are (at some level) are basically huge matrices of cell x gene.AFAIR they originally used compressed matrix market, but that's undesired option.

These days they use their own formats to incorporate metadata associated with rows and cols, usually on top of hdf5 or similar format. And they usually just keep row/col/data as I've done in the python excerpt above.

https://anndata.readthedocs.io/en/latest/ and h5seurat are (I think) the most widely adopted, but 1) it clearly targets the application 2) additionally keeps different analysis artifacts, which makes the format more complicated. OTOH anndata package is quite user-friendly.

arogozhnikov commented 11 months ago

compressed mtx seems still supported: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices

alugowski commented 11 months ago

compressed mtx seems still supported: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices

Ok cool. So the Python side uses SciPy's loader:

mat = scipy.io.mmread(os.path.join(matrix_dir, "matrix.mtx.gz"))

FYI the SciPy loader is as slow as anything on this list (for now). I wrote FMM partly to speed it up, and it does. You can expect 20x speedup on your laptop: https://github.com/alugowski/fast_matrix_market/tree/main/python It'll handle gzipped files too, but I haven't benchmarked the effect on speed. I expect the decompressor to be the bottleneck (for FMM, not for Python scipy.io). Probably still worth it for the disk usage savings, though maybe a native C++ decompressor might be worth looking at.

I do say "for now", because we're close to getting FMM merged into scipy, so in the future that example code should magically get faster for free for you.

Based on that example it looks like R's readMM can handle gzipped files too. That's good to know.

arogozhnikov commented 11 months ago

so in the future that example code should magically get faster for free for you.

cool.

But just in case: I don't use mtx/mtx.gz and I think very few do these days.

I appreciate a lot your endeavors because scipy's mmread is indeed fantastically slow, but I think we'd better promote some binary format as an alternative. Personally I'd much prefer trivial parquet-based / hdf5-based format to mtx/mtx.gz 😄

alugowski commented 11 months ago

Parquet:

python bench_polars.py

-----------------------------------------------------------------------------------------------------------------------
Benchmark                                                             Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------
op:read/impl:Polars/format:Parquet/0/iterations:1/real_time       0.272 s         0.003 s             1 1024MiB.mtx=0 MM_equivalent_bytes_per_second=3.94835G/s bytes_per_second=1.97559G/s
op:read/impl:Polars/format:Parquet/1/iterations:1/real_time       0.208 s         0.002 s             1 1024MiB.sorted.mtx=1 MM_equivalent_bytes_per_second=5.16654G/s bytes_per_second=1.99667G/s
op:write/impl:Polars/format:Parquet/0/iterations:1/real_time       2.75 s          2.73 s             1 1024MiB.mtx=0 MM_equivalent_bytes_per_second=390.834M/s bytes_per_second=200.25M/s
op:write/impl:Polars/format:Parquet/1/iterations:1/real_time       2.70 s          2.69 s             1 1024MiB.sorted.mtx=1 MM_equivalent_bytes_per_second=398.328M/s bytes_per_second=157.633M/s

10GiB file (note machine has 16GiB RAM):

-----------------------------------------------------------------------------------------------------------------------
Benchmark                                                             Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------
op:read/impl:Polars/format:Parquet/0/iterations:1/real_time        27.7 s         0.053 s             1 10240MiB.mtx=0 MM_equivalent_bytes_per_second=387.431M/s bytes_per_second=198.52M/s
op:write/impl:Polars/format:Parquet/0/iterations:1/real_time       37.3 s          30.0 s             1 10240MiB.mtx=0 MM_equivalent_bytes_per_second=287.603M/s bytes_per_second=147.368M/s

I guess it's ok. A bit faster reads, a bit slower writes. File size is smaller than MM, but larger than compressed MM. Bad performance when the matrix is a large fraction of memory, but that is common, especially in Python land.

Honestly I'm disappointed in all the binary formats I've tried. They should be maxing out the IO, not going head-to-head with a human-readable archival format.

Again this is on a 6-core laptop. FMM can use more cores.

alugowski commented 11 months ago

What does your hdf5 workflow look like? What sorts of performance do you see?

arogozhnikov commented 11 months ago

FMM can use more cores.

Not sure I'd love using many cores for IO, but tastes differ :)

op:read/impl:Polars/format:Parquet/0/iterations:1/real_time 27.7 s 0.053 s

0.053 sec? strange, likely just IO bound. You have good ssd?

I've ran my benckmark with dense-pretends sparse with 1GB file:

reading:

CPU times: user 2.35 s, sys: 778 ms, total: 3.13 s Wall time: 2.43 s

construction of scipy matrix

CPU times: user 167 ms, sys: 153 ms, total: 319 ms Wall time: 347 ms

That's mac m1, but most systems with reasonable SSD should be in the same range.

Parquet has settings for better compression, just as gzip.

arogozhnikov commented 11 months ago

For hdf5, pipeline is the same - you use h5py to store row, cols, indices.

Or use https://anndata.readthedocs.io/en/latest/api.html with read_h5ad and write_h5ad, which does this and saves additional metadata.

HDF5 <> parquet: I think these days parquet is more widespread than hdf5, and tooling like fastparquet is usually better. Also parquet got traction from big tech + it is more like a standard, whereas hdf5 is more like 'standard implementation'.

alugowski commented 11 months ago

FMM can use more cores.

Not sure I'd love using many cores for IO, but tastes differ :)

Of course inefficient I/O isn't good, and matrix market certainly isn't optimal for speed.

But if those cores are idle and the operation is compute bound, which ALL of these appear to be, Parquet included, then why not use the idle cores to finish faster?

op:read/impl:Polars/format:Parquet/0/iterations:1/real_time 27.7 s 0.053 s

0.053 sec? strange, likely just IO bound. You have good ssd?

27.7 wallclock time.

Notice it's a 10GB file on a machine with 16GB RAM. The point of that test is to see how well the I/O library handles matrices that take up more than half of RAM. Often they like to duplicate the matrix, either explicitly somehow or implicitly by a cache.

This is one example of bad performance. The 10G file is 10x larger than 1G but takes 100x as long.

I've ran my benckmark with dense-pretends sparse with 1GB file:

reading:

CPU times: user 2.35 s, sys: 778 ms, total: 3.13 s Wall time: 2.43 s

So 10x slower than what my benchmark showed? I saw 0.272s for a sparse 1GB file with Parquet.

Just export your file to .mtx and run python bench_polars.py to run my code against it. It'll also compute effective bytes/second which while imperfect in its own ways at least gives some context whether the number is good or bad.

construction of scipy matrix

CPU times: user 167 ms, sys: 153 ms, total: 319 ms Wall time: 347 ms

That's mac m1, but most systems with reasonable SSD should be in the same range.

Parquet has settings for better compression, just as gzip.

I also have an M1 mac.

arogozhnikov commented 11 months ago

I've pointed to unusually low CPU usage, now I see the reason.

This is one example of bad performance.

benchmark makes no practical sense.

If you don't have twice memory, you can't operate it in python (as practically anything creates a copy). You could appreciate that a trivial and efficient format was assembled in a couple of lines, and I put zero efforts in optimizing it.

And yes, there is conversion from polars to numpy that creates a copy. If you put some efforts (namely, reading separately col-row-data) you could alleviate the "problem" without performance downsides. That's three more lines, but why solving non-issue?

Just export your file to .mtx

I don't have that much space on drive - mtx is 6 times larger for smaller sizes. Compressing would make this comparison fair and meaningful, but will reduce any margins. That was my original point.

For the sake of science, side to side, but on smaller size:

saving/loading/converting to scipy

Wall time: 992 ms CPU times: user 686 ms, sys: 401 ms, total: 1.09 s Wall time: 601 ms CPU times: user 925 ms, sys: 278 ms, total: 1.2 s Wall time: 60 ms CPU times: user 43.4 ms, sys: 16.7 ms, total: 60.1 ms

saving/loading/converting to scipy with snappy compression, level=1

Wall time: 564 ms CPU times: user 322 ms, sys: 326 ms, total: 647 ms Wall time: 247 ms CPU times: user 530 ms, sys: 113 ms, total: 643 ms Wall time: 295 ms CPU times: user 161 ms, sys: 43.1 ms, total: 204 ms

fmm (in python, reading to scipy, includes conversion)

CPU times: user 7.62 s, sys: 767 ms, total: 8.39 s Wall time: 1.34 s

filesizes:

2.3G Jul 16 18:56 /tmp/sparse_coo.mtx 476M Jul 16 18:55 /tmp/sparse_coo.pqt # snappy 390M Jul 16 18:59 /tmp/sparse_coo.pqt # default

I don't doubt you have the fastest mtx reader in the west. I doubt mtx is a way forward. Community finally (after so many years) makes conversion csv > parquet, similar should happen with other formats, they should be replaced with scalable ones. You know, bills for AWS storage converts people better than arguments 😄

There are a ton of issues around using sparse in python if you're curious for some practical problems:

mtx does not have binary representation and thus bit-level correspondence during parsing can't be guaranteed. Specially io in your lib is updated, and you spend hours trying to figure out why numbers changed. At the time there is no standard binary format.
there are dramatic problems with sparse matrices/ndarray in scipy itself - matrices are legacy with bad API, and ndarrays did not flush that legacy.
Xarray has a number of request for support of sparse, but does not add ones because of scipy's limitations: https://github.com/pydata/xarray/issues/3213#issuecomment-615499609
sparse ndarrays https://sparse.pydata.org/en/stable/generated/sparse.save_npz.html#sparse.save_npz with ndarrays and a better API, but use python-specific npz for serialization. It could become a standard for sparse matrices, but presence of sparse in scipy prevents this.
no standard sparse with labeled axes (as opposed e.g. to R with sparse dataframes). AnnData is close to filling that gap, but it is used only among biologists

alugowski commented 11 months ago

I have to disagree that large matrix benchmarks make no sense. Maybe your use case doesn't care, but there are many that do.

Naturally 10GB is an extreme example, but on purpose. The slowdowns begin around 4GB. Here are reads on files going from 1GB to 8GB:

op:read/impl:Polars/format:Parquet/0/iterations:2/real_time       0.242 s         0.007 s             2 1024MiB.mtx=0 MM_equivalent_bytes_per_second=4.42946G/s bytes_per_second=2.21662G/s
op:read/impl:Polars/format:Parquet/1/iterations:2/real_time       0.551 s         0.013 s             2 2048MiB.mtx=1 MM_equivalent_bytes_per_second=3.89672G/s bytes_per_second=1.94987G/s
op:read/impl:Polars/format:Parquet/2/iterations:2/real_time       0.863 s         0.020 s             2 3072MiB.mtx=2 MM_equivalent_bytes_per_second=3.73373G/s bytes_per_second=1.86828G/s
op:read/impl:Polars/format:Parquet/3/iterations:2/real_time        2.89 s         0.121 s             2 4096MiB.mtx=3 MM_equivalent_bytes_per_second=1.487G/s bytes_per_second=761.878M/s
op:read/impl:Polars/format:Parquet/4/iterations:2/real_time        5.36 s         0.111 s             2 5120MiB.mtx=4 MM_equivalent_bytes_per_second=1002.28M/s bytes_per_second=513.573M/s
op:read/impl:Polars/format:Parquet/5/iterations:2/real_time        7.59 s         0.131 s             2 6144MiB.mtx=5 MM_equivalent_bytes_per_second=848.599M/s bytes_per_second=434.814M/s
op:read/impl:Polars/format:Parquet/6/iterations:2/real_time        10.1 s         0.105 s             2 7168MiB.mtx=6 MM_equivalent_bytes_per_second=747.135M/s bytes_per_second=382.846M/s
op:read/impl:Polars/format:Parquet/7/iterations:2/real_time        17.6 s         0.086 s             2 8192MiB.mtx=7 MM_equivalent_bytes_per_second=488.149M/s bytes_per_second=250.128M/s

Notice the read speed goes from an effective 2GB/s to 0.25GB/s. That's a big drop. Also the writes are at 200MB/s at all file sizes.

A binary format should be able to do much better than that.

Maybe you're ok with writing custom code for every file, but that should not be required. It's certainly a deal breaker in a lot of applications.

I'm not trying to sell Matrix Market as the future for everyone. It does have its applications (maybe you're not in the target market for it, that's ok) and benefits that I won't list here because I know you don't care about them (and that's ok). I use it here because it's the only universal format so it serves as a useful comparison target.

alugowski commented 11 months ago

There are many ways to do compression in C++, if you have favorites that would work well with this repo LMK.

alugowski / sparse-matrix-io-comparison

gzip? #1

reading:

construction of scipy matrix

reading:

construction of scipy matrix

saving/loading/converting to scipy

saving/loading/converting to scipy with snappy compression, level=1

fmm (in python, reading to scipy, includes conversion)

filesizes: