Open vals opened 6 years ago
Wow, that's a nice improvement.
Yeah I don't have the exact numbers right now. But reading a large mtx
that was stored as text would take me ~15 minutes, while the Parquet encoded version of the same mtx
would take about a minute.
cool benchmark @vals
fwiw, the "scale" HCA group has been discussing formats, in particular parquet, which several of us have experience with in the ADAM project
we mostly dead-ended (for some uses) at the "can't do random access" issue you highlighted; @laserson mentioned that Parquet in principle supports indexing, but none of us have used it that way.
Hey! Regarding the goals of @vals (read-write-speed, compression, etc.) without the disadvantage of access limitations: this can be achieved through anndata.h5py
if I'm not mistaken... See these explanations
Taking your example above
import pandas as pd
from anndata import read_mtx, read_h5ad
adata = read_mtx('./data.mtx')
print(adata.X) # this prints a sparse matrix
adata.write('./data.h5ad) # this uses `anndata.h5py` to efficiently store sparse data in HDF5
adata = read_h5ad('./data.h5ad') # this will be a lot faster than the initial call...
adata = read_h5ad('./data.h5ad', backed='r') # this will be super fast as it doesn't load into memory
It would be really nice to read your feedback... maybe I overlooked something... :wink:
Alex
Hi Alex,
Actually I pair my .mtx.parquet
files with .loom
files that I interact with using ScanPy for random access. Then depending on if I feel like filling my RAM (and have enough) or want to work out-of-core I decide which one I use.
Are .h5ad
faster than .loom
? I thought the difference between them was just in organization. Reading the sparse COO matrix from parquet
is much faster then loading from .loom
. I jumped directly to .loom
because it was equivalent to .h5ad
. (Again, sorry, no real numbers)
.h5ad
should be a lot faster than .loom
only for sparse data - in contrast to loom, .h5ad
stores sparse data in a sparse data structure on the HDF5-level... similarly to what you do using parquet... only that I wrote wrapper for h5py that defines a anndata.h5py.SparseDataset
and hence gives you very convenient access to all slices using .[]
, just like for h5py.Dataset
...
in addition, .loom
does not support categorical data, arbitrary unstructured annotation and it doesn't use structured arrays on the HDF5 level... which again make subsetting easier...
there are a few more differences... but yeah, both are absolutely following the same idea
Hi @falexwolf I just had a reason to revisit some .h5ad
files I made a while ago, and found the "backed" mode for the sparse matrix to be pretty smooth!
It took me a while to figure out how to use these functionalities though, it would be great with an example in the documentation. I even tried looking at the examples with the millions of cells, but it seems you just load them all into memory?
Reading your previous reply here I realised I had completely missed that you have a blog post describing how to use it! Oh well.
One thing that's convenient in the loom
framework is the scan
functionality, that enable you to execute several functions row- or column-wise in one go. Generally, convenience functions for iterating over small batches in the h5ad
would be for various machine learning inference applications.
Hi @vals!
You're right! Up to very recently, in Scanpy, we simply worked with AnnData's memory
mode. We did this both for speed reasons and for fast code prototyping - even though the backed
mode was conceived for AnnData from the very beginning and is functional since last December.
We're currently working on extending all scanpy functions both to mini-batch=chunk
processing... both in memory
and backed
mode. You'll note the speed difference.
Right now, what is there in terms of convenience functions on the master branches - to be released in a couple of days - is
anndata.AnnData.chunked_X()
scanpy.api.pp.pca(adata, chunked=True)
scanpy.api.pp.log1p(adata, chunked=True)
We'll add more of such functions... also for iterating over variables and raw data etc... Not so far in the future, we'll have a settings.backed_mode=True
switch that will allow you to run a pipeline that was previously written for memory
mode entirely in backed
mode without changing a single line of code...
Any feedback on the whole issue is welcome... we're still in the process of thinking about design principles... For instance, there still seems to be a strange problem here, which we're also currently looking into...
When everything is settled, there will be proper tutorials etc...
PS: Besides the blog post on sparse matrices, there is an even more band-aid style like post on AnnData: http://falexwolf.de/blog/171223_AnnData_indexing_views_HDF5-backing/
We're currently improving the whole AnnData-backed infrastructure considerably. @Koncopd now has some capacities to do this properly... After @Koncopd was able to do extensive benchmarks, it turned out - and was to be expected - that my initial quickly drafted design is very suboptimal... We'll report progress on this, here... We'll also talk to the Linnarson people in order to further harmonize AnnData with loom... Of course, everything will remain backwards-compat.
Hi @vals, we have experimented with Parquet for single cell gene expression data. It enables column-wise loading of data through Pandas for instance
pd.read_parquet(filename, columns=your_columns_of_interest)
It has very good compression, fast I/O, and is widely supported. However, it lacks row and column specific reading like Zarr. We have an example using the format here where we just trivially load the entire parquet file (of downsampled single cell gene expression data):
I have been speeding up reading/writing sparse matrices by writing the triplet format used in
mtx
as Parquet files (https://parquet.apache.org/). The main benefit is the increased read speed, but it also compressed and takes less space. For example a 31GBmtx
file only use 7GB when saved as parquet.I save the sparse matrix tables with first column called
cell_idx
, second column calledgene_idx
and the third column calledcount
. Then I can read it like this:Drawbacks are the same as for
mtx
, in particular you cannot read random chunks of data.Thought I would mention it since Parquet was not on the list of investigated formats. Sometimes I also just store flat "dense" matrices in Parquet, but this doesn't scale so well to tens of thousands of cells.