HumanCellAtlas / table-testing

requirements, examples, and tests for expression matrix file formats
MIT License
22 stars 3 forks source link

Suggestion to investigate - Parquet, which offers fast IO #5

Open vals opened 6 years ago

vals commented 6 years ago

I have been speeding up reading/writing sparse matrices by writing the triplet format used in mtx as Parquet files (https://parquet.apache.org/). The main benefit is the increased read speed, but it also compressed and takes less space. For example a 31GB mtx file only use 7GB when saved as parquet.

I save the sparse matrix tables with first column called cell_idx, second column called gene_idx and the third column called count. Then I can read it like this:

import pandas as pd
from scipy import sparse

data = pd.read_parquet('data.mtx.parquet')
count_matrix = sparse.csc_matrix((data['count'].values, (data.cell_idx.values - 1, data.gene_idx.values - 1)))

Drawbacks are the same as for mtx, in particular you cannot read random chunks of data.

Thought I would mention it since Parquet was not on the list of investigated formats. Sometimes I also just store flat "dense" matrices in Parquet, but this doesn't scale so well to tens of thousands of cells.

roryk commented 6 years ago

Wow, that's a nice improvement.

vals commented 6 years ago

Yeah I don't have the exact numbers right now. But reading a large mtx that was stored as text would take me ~15 minutes, while the Parquet encoded version of the same mtx would take about a minute.

ryan-williams commented 6 years ago

cool benchmark @vals

fwiw, the "scale" HCA group has been discussing formats, in particular parquet, which several of us have experience with in the ADAM project

we mostly dead-ended (for some uses) at the "can't do random access" issue you highlighted; @laserson mentioned that Parquet in principle supports indexing, but none of us have used it that way.

falexwolf commented 6 years ago

Hey! Regarding the goals of @vals (read-write-speed, compression, etc.) without the disadvantage of access limitations: this can be achieved through anndata.h5py if I'm not mistaken... See these explanations

Taking your example above

import pandas as pd
from anndata import read_mtx, read_h5ad

adata = read_mtx('./data.mtx')
print(adata.X)  # this prints a sparse matrix
adata.write('./data.h5ad)  # this uses `anndata.h5py` to efficiently store sparse data in HDF5
adata = read_h5ad('./data.h5ad')  # this will be a lot faster than the initial call...
adata = read_h5ad('./data.h5ad', backed='r')  # this will be super fast as it doesn't load into memory

It would be really nice to read your feedback... maybe I overlooked something... :wink:

Alex

vals commented 6 years ago

Hi Alex,

Actually I pair my .mtx.parquet files with .loom files that I interact with using ScanPy for random access. Then depending on if I feel like filling my RAM (and have enough) or want to work out-of-core I decide which one I use.

Are .h5ad faster than .loom? I thought the difference between them was just in organization. Reading the sparse COO matrix from parquet is much faster then loading from .loom. I jumped directly to .loom because it was equivalent to .h5ad. (Again, sorry, no real numbers)

falexwolf commented 6 years ago

.h5ad should be a lot faster than .loom only for sparse data - in contrast to loom, .h5ad stores sparse data in a sparse data structure on the HDF5-level... similarly to what you do using parquet... only that I wrote wrapper for h5py that defines a anndata.h5py.SparseDataset and hence gives you very convenient access to all slices using .[], just like for h5py.Dataset...

in addition, .loom does not support categorical data, arbitrary unstructured annotation and it doesn't use structured arrays on the HDF5 level... which again make subsetting easier...

there are a few more differences... but yeah, both are absolutely following the same idea

vals commented 6 years ago

Hi @falexwolf I just had a reason to revisit some .h5ad files I made a while ago, and found the "backed" mode for the sparse matrix to be pretty smooth!

It took me a while to figure out how to use these functionalities though, it would be great with an example in the documentation. I even tried looking at the examples with the millions of cells, but it seems you just load them all into memory?

Reading your previous reply here I realised I had completely missed that you have a blog post describing how to use it! Oh well.

One thing that's convenient in the loom framework is the scan functionality, that enable you to execute several functions row- or column-wise in one go. Generally, convenience functions for iterating over small batches in the h5ad would be for various machine learning inference applications.

falexwolf commented 6 years ago

Hi @vals!

You're right! Up to very recently, in Scanpy, we simply worked with AnnData's memory mode. We did this both for speed reasons and for fast code prototyping - even though the backed mode was conceived for AnnData from the very beginning and is functional since last December.

We're currently working on extending all scanpy functions both to mini-batch=chunk processing... both in memory and backed mode. You'll note the speed difference.

Right now, what is there in terms of convenience functions on the master branches - to be released in a couple of days - is

We'll add more of such functions... also for iterating over variables and raw data etc... Not so far in the future, we'll have a settings.backed_mode=True switch that will allow you to run a pipeline that was previously written for memory mode entirely in backed mode without changing a single line of code...

Any feedback on the whole issue is welcome... we're still in the process of thinking about design principles... For instance, there still seems to be a strange problem here, which we're also currently looking into...

When everything is settled, there will be proper tutorials etc...

PS: Besides the blog post on sparse matrices, there is an even more band-aid style like post on AnnData: http://falexwolf.de/blog/171223_AnnData_indexing_views_HDF5-backing/

falexwolf commented 6 years ago

We're currently improving the whole AnnData-backed infrastructure considerably. @Koncopd now has some capacities to do this properly... After @Koncopd was able to do extensive benchmarks, it turned out - and was to be expected - that my initial quickly drafted design is very suboptimal... We'll report progress on this, here... We'll also talk to the Linnarson people in order to further harmonize AnnData with loom... Of course, everything will remain backwards-compat.

cornhundred commented 5 years ago

Hi @vals, we have experimented with Parquet for single cell gene expression data. It enables column-wise loading of data through Pandas for instance

pd.read_parquet(filename, columns=your_columns_of_interest)

It has very good compression, fast I/O, and is widely supported. However, it lacks row and column specific reading like Zarr. We have an example using the format here where we just trivially load the entire parquet file (of downsampled single cell gene expression data):

https://github.com/ismms-himc/clustergrammer2-notebooks#52-mouse-organogenesis-cell-atlas-2-million-cells