HumanCellAtlas / table-testing

requirements, examples, and tests for expression matrix file formats
MIT License
22 stars 3 forks source link

Add zarr-chunking along first axis, based on (uncompressed) size #9

Open ryan-williams opened 6 years ago

ryan-williams commented 6 years ago

As discussed on the matrix data formats call just now, I wrote some logic for chunking every dataset in a zarr "file" along only its first dimension, such that the uncompressed size is a given amount (e.g. 64MB).

This also required a minor change to zarr that lives in lasersonlab/zarr for now but will hopefully be upstreamed later.

This chunking scheme arguably makes more sense for RNA data (rows are individual cells and include all genes for that cell, chunks are a fixed number of cells/rows) than a 2D-chunking scheme (though for e.g. image data 2D-chunking is likely better), so we should include a few such "row-chunked" zarr formats here (perhaps at 32MB, 64MB, and 128MB chunk-sizes) to see how they fare against other formats.

We/I should also put that logic somewhere more broadly accessible.

ryan-williams commented 6 years ago

A first step here can be for me to upload to a public bucket the row-chunked zarrs i've created of a few reference datasets, so that we can eyeball the tradeoffs related to number of chunk-files, chunk-file sizes, etc.

samanvp commented 6 years ago

I think we need to consider following points here:

ryan-williams commented 6 years ago

Interesting points!

Data type: IMO for RNA-Seq data uint16 would be enough, so 2 bytes per expression value.

I don't totally follow this; my WIP size-based-chunking code uses the data-type's size to compute the uncompressed-chunk-size, which IME is typically a 4-byte float or int, from eyeballing some 10x/anndata files.

Compression: Many formats (including zarr) compress chunks so when we aim for 32MB chunk size, their actual size would be smaller (and I suspect the compression rate would depend on the sparsity of the matrix).

Yea, using the uncompressed size is somewhat necessary (because it's hard/infeasible to chunk based on compressed size), and somewhat desirable, since "we" want the number of processed records to be similar across distributed workers.

As an aside, myself in others working with Spark for genomics data have run into trouble having fixed-size chunks of compressed data (e.g. when a BAM file is broken into 32MB chunks), because the actual number of records in each chunk can vary wildly if e.g. one chunk has a lot of duplicate reads, which compress pathologically well; I've seen 100x swings in effective/uncompressed size of chunks due to this.

Optimal chunk size: Perhaps this would different for different storage backends. For example, according to this article

Good point; we should try to make sure that our uncompressed-sized-chunks still yield compressed chunks that are >1MB.

As a point of reference, here's an HCA 10x dataset that I converted to anndata HDF5 and then to anndata zarr with 128MB (uncompressed) chunks:

$ tree -sh -L 3 ica_cord_blood.ad.128m.zarr/
ica_cord_blood.ad.128m.zarr/
├── [ 192]  X
│   ├── [ 352]  data
│   │   ├── [ 37M]  0
│   │   ├── [ 37M]  1
│   │   ├── [ 37M]  2
│   │   ├── [ 38M]  3
│   │   ├── [ 38M]  4
│   │   ├── [ 38M]  5
│   │   ├── [ 35M]  6
│   │   └── [ 28M]  7
│   ├── [ 352]  indices
│   │   ├── [ 47M]  0
│   │   ├── [ 46M]  1
│   │   ├── [ 46M]  2
│   │   ├── [ 47M]  3
│   │   ├── [ 47M]  4
│   │   ├── [ 47M]  5
│   │   ├── [ 46M]  6
│   │   └── [ 37M]  7
│   └── [ 128]  indptr
│       └── [1.3M]  0
├── [ 128]  obs
│   └── [2.8M]  0
└── [ 128]  var
    └── [1.0M]  0

likewise with 32MB (uncompressed) chunks:

ica_cord_blood.ad.32m.zarr/
├── [ 192]  X
│   ├── [1.1K]  data
│   │   ├── [9.4M]  0
│   │   ├── [9.4M]  1
│   │   ├── [9.3M]  10
│   │   ├── [9.4M]  11
│   │   ├── [9.5M]  12
│   │   ├── [9.4M]  13
│   │   ├── [9.5M]  14
│   │   ├── [9.5M]  15
│   │   ├── [9.4M]  16
│   │   ├── [9.4M]  17
│   │   ├── [9.4M]  18
│   │   ├── [9.5M]  19
│   │   ├── [9.4M]  2
│   │   ├── [9.5M]  20
│   │   ├── [9.6M]  21
│   │   ├── [9.6M]  22
│   │   ├── [8.9M]  23
│   │   ├── [8.9M]  24
│   │   ├── [8.8M]  25
│   │   ├── [8.7M]  26
│   │   ├── [8.7M]  27
│   │   ├── [9.0M]  28
│   │   ├── [9.1M]  29
│   │   ├── [9.3M]  3
│   │   ├── [9.4M]  30
│   │   ├── [622K]  31
│   │   ├── [9.3M]  4
│   │   ├── [9.2M]  5
│   │   ├── [9.3M]  6
│   │   ├── [9.3M]  7
│   │   ├── [9.3M]  8
│   │   └── [9.3M]  9
│   ├── [1.1K]  indices
│   │   ├── [ 12M]  0
│   │   ├── [ 12M]  1
│   │   ├── [ 11M]  10
│   │   ├── [ 12M]  11
│   │   ├── [ 12M]  12
│   │   ├── [ 12M]  13
│   │   ├── [ 12M]  14
│   │   ├── [ 12M]  15
│   │   ├── [ 12M]  16
│   │   ├── [ 12M]  17
│   │   ├── [ 12M]  18
│   │   ├── [ 12M]  19
│   │   ├── [ 12M]  2
│   │   ├── [ 12M]  20
│   │   ├── [ 12M]  21
│   │   ├── [ 12M]  22
│   │   ├── [ 12M]  23
│   │   ├── [ 12M]  24
│   │   ├── [ 12M]  25
│   │   ├── [ 12M]  26
│   │   ├── [ 12M]  27
│   │   ├── [ 12M]  28
│   │   ├── [ 12M]  29
│   │   ├── [ 12M]  3
│   │   ├── [ 12M]  30
│   │   ├── [753K]  31
│   │   ├── [ 12M]  4
│   │   ├── [ 12M]  5
│   │   ├── [ 12M]  6
│   │   ├── [ 12M]  7
│   │   ├── [ 12M]  8
│   │   └── [ 12M]  9
│   └── [ 128]  indptr
│       └── [889K]  0
├── [ 128]  obs
│   └── [2.4M]  0
└── [ 128]  var
    └── [647K]  0

So it seems like we're getting a compression rate of 2-3 on these (similar to the common range for BAMs, incidentally), so it feels like it should be easy to size the chunks for reasonable compressed and uncompressed sizes.