HumanCellAtlas / table-testing

requirements, examples, and tests for expression matrix file formats
MIT License
22 stars 3 forks source link

table-testing

This repository provides a collaborative space to specify requirements, examples, and tests for on-disk file formats used to store expression matrices arising from single-cell RNA sequencing analyses. There are currently a wide variety of formats used for these data, including generic formats (e.g. CSV) and those designed specifically for this domain (e.g. Loom).

Please open an issue or make a PR if you'd like to add ideas or start a discussion!

Associated documents that inform this repo can be found in this Google Drive and are linked throughout the repository.

We meet from time to time, here are our meeting notes, feel free to reach out and join!

Goals

Requirements

In many discussions, requirements fall broadly into two categories, archival (long-term storage) and analysis (daily use with analytical software in e.g. R or Python). As written, some of these are explicit requirements (e.g. self-describing), whereas others are dimensions along which different formats vary (e.g. size and speed).

Archival

Analysis

Formats

Here we list formats that have been used or proposed thus far in the community (please add!):

Standard Matrices

We benchmark data formats and APIs on standard matrices. These matrices are found in public S3 and GCP buckets: s3://matrix-format-test-data and gs://matrix-format-test-data.

That bucket has two relevant folders, matrices and sources. The sources folder contains raw data from a number of different data sources, in the form that they were originally delivered. These are meant to span a range of matrix sizes and properties so we can develop tests for diverse use cases:

Source Name Description
tenx_mouse_neuron_1M The 1.3 million mouse neuron dataset released by 10x.
tenx_mouse_neuron_20k A sample of 20k cells from the 1.3 million cell dataset, also provided by 10x.
immune_cell_census_cord_blood Cord blood cells from the Immune Cell Atlas dataset released for the HCA preview. 384,000 cells sequenced with 10x.
immune_cell_census_bone_marrow Bone marrow cells from the Immune Cell Atlas dataset released for the HCA preview. 378,000 cells sequenced with 10x.
GSE84465_whole 3589 cells sequenced with SmartSeq2 from GEO Series GSE84465.
GSE84465_split 3589 cells sequenced with SmartSeq2 from GEO Series GSE84465, split so there is one matrix file per cell.

The matrices folder contains two levels of subfolders. The first is the matrix format. These are the formats that are being tested for performance among a set of different tasks.

Format Name Description
anndata The HDF5-based format used by scanpy
feather Feather format
hdf5_10000_10000_chunks_3_compression Dense matrix in an HDF5 dataset with (10000, 10000) chunks and gzip=3 compression
hdf5_1000_1000_chunks_3_compression Dense matrix in an HDF5 dataset with (1000, 1000) chunks and gzip=3 compression
hdf5_25000_25000_chunks_3_compression Dense matrix in an HDF5 dataset with (25000, 25000) chunks and gzip=3 compression
matrix_market Matrix Market format
numpy NumPy .npy format
loom Loom format
parquet Parquet format
sparse_hdf5_csc Compressed Sparse Column matrix in HDF5, similar to what is produced by Cell Ranger
sparse_hdf5_csr Compressed Sparse Row matrix in HDF5
zarr_10000_10000 Zarr Directory Store with (10000, 10000) chunks
zarr_1000_1000 Zarr Directory Store with (1000, 1000) chunks
zarr_25000_25000 Zarr Directory Store with (25000, 25000) chunks

The next level of subfolder is the sources, which are detailed above. So for example, s3://matrix-format-test-data/matrices/anndata/immune_cell_census_bone_marrow/anndata_immune_cell_census_bone_marrow.h5ad is the Immune Cell Atlas bone marrow data converted into the scanpy anndata format.

The S3 paths are stored in a more machine readable format in data.yaml.

Test suite

The tests are split into "tasks", where a task is some operation on an expression matrix. Within each task, there are one or more "tests", which accomplish the task using one of the matrix formats.

Test structure

Each test has a "test.yaml" file that indicates how the test should be run. The test.yaml has the following keys:

The test also needs to specify a docker container with an entrypoint that accepts the following command line arguments:

The details of how the inputs are converted to the outputs are within the test's docker container and any scripts or other code that gets added to it. The benchmarking code will be responsible for staging inputs, running the container, and measuring performance.

Feel free to contribute!