[ENH] Is there utility for single-cell genomics?

arq5x commented 2 years ago

ENCODE folks suggested this was a need. What are the current bottlenecks? Speed? ANalytical flexibility?

jmschrei commented 1 year ago

I am also interested in this. I'd like to store a single-cell ATAC-seq experiment at basepair resolution. What ended up being the answer?

arq5x commented 1 year ago

Thank you, @jmschrei - we never had clarity about the needs. Currently, D4 supports multiple tracks (e.g., samples) per file. Therefore, it could currently store per-base depths. Would you want distinct depth vectors per cell?

jmschrei commented 1 year ago

Yes, the idea would be to have a single file flexibly hold a (VERY sparse) cell x genome bp resolution count matrix and be able to access it quickly -- potentially during the training of a machine learning model, or otherwise just to store this count matrix in a single file somewhere with an easy way to access it. Right now I'm storing one scipy sparse matrix per chromosome, which seems annoying.

Can you point me to the documentation of multi-track d4 creation? Most of the documentation on the GitHub README seems to address converting a single track from an existing format (e.g. bigwig) into the d4 format. What do you think the best way to go from fragment file to a d4 object would be?

arq5x commented 1 year ago

Can you clarify what a fragment file is? There are some examples of writing D4 files with the Python API here: https://github.com/38/d4-format/blob/master/pyd4/examples/D4%20Tutorial%20in%20Python.ipynb

@38 and @brentp do you have any thoughts about how to take potentially thousands (1 per cell) of very sparse genome-wide depth vectors and programmatically storing them as tracks in a multi-track D4 file?

arq5x commented 1 year ago

I don't think the API currently meets your exact need but perhaps it could be enhanced to do so.

jmschrei commented 1 year ago

A fragment file is a bed-like file that contains fragments mapped to a reference genome. In addition to the start and end of the reads, it also contains the cell ID that the fragment was mapped to. An entire single-cell experiment can be contained within a fragment file but they are not particularly efficient for slicing and aggregation.

And, to clarify, it's not so much "potentially thousands" as "definitely tens of thousands, and up to a million." My data set from several years ago has 68k cells.

What part of the API is missing? I'm happy to load all the data into memory for usage as long as it can be stored efficiently on disk in a single file.

arq5x commented 1 year ago

My understanding of the Python API is that you would have to create a distinct D4 file for each cell using the D4Builder class, then use the D4Merger class to combine them all. @38 is there a way of creating a D4 file from N numpy arrays of the same length M? Perhaps by using D4Builder to create the stub D4 file, then giving a add_tagged_tracks() method lists of numpy arrays and track names from which to populate the D4 file?

arq5x commented 1 year ago

The following program illustrates how this can be (mostly) done with the existing API. Basically, it simulates sparse data for 10 cells, writes the data to a distinct d4 file for each, merges the 10 d4 files into one (arbitrarily chosen) and then shows how to access the data. One weakness is a track name is not created for the first cell. Another major weakness is d4 files need to be landed to disk. Ideally, we could write tracks to a D4 file directly from a numpy array. @38 are fixes / improvements possible or existing, yet undocumented?

from pyd4 import D4File, D4Builder, D4Merger
import numpy

chrom_len = 10000000
chroms = [("1", chrom_len)]

# simulate 10 sparse d4 files
num_cells = 10
for i in range(num_cells):
    d4file = D4Builder("cell_" + str(i) + ".d4") \
    .add_chroms(chroms) \
    .for_sparse_data() \
    .get_writer()
    # simulate sparse data with poisson, lamdba = 0.01
    cell_data = numpy.random.poisson(0.01, chrom_len)
    # write it to the temp d4 file and close
    d4file.write_np_array("1", 0, cell_data)
    d4file.close()

# merge all d4 files
# arbitrarily choose cell_0 file to append to
combined_d4 = D4Merger("cell_0.d4")
for i in range(1, num_cells):
    label = "cell_" + str(i)
    file_to_merge = "cell_" + str(i) + ".d4"
    # add cell_i data to the d4 file
    combined_d4.add_tagged_track(label, file_to_merge)
# finalize the merged file.
combined_d4.merge()

# do stuff with the merged file
merged_file = D4File("cell_0.d4")
print("Tracks in file", merged_file.list_tracks())

matrix = merged_file.open_all_tracks()
# To enumerate the matrix 
for values in matrix.enumerate_values("1", 0, chrom_len):
    print(values)
# You can also downsample the matrix 
matrix.resample("1", bin_size = 10000)

arq5x commented 1 year ago

looking at the first few lines of the resulting file (1.9Mb; chr, start, end, values for each track)

d4tools view cell_0.d4 | head
1   0   7   0   0   0   0   0   0   0   0   0
1   7   8   0   0   0   1   0   0   0   0   0
1   8   11  0   0   0   0   0   0   0   0   0
1   11  12  0   0   0   1   0   0   0   0   0
1   12  24  0   0   0   0   0   0   0   0   0
1   24  25  0   0   1   0   0   0   0   0   0
1   25  30  0   0   0   0   0   0   0   0   0
1   30  31  1   0   0   0   0   0   0   0   0
1   31  55  0   0   0   0   0   0   0   0   0
1   55  56  0   0   0   0   0   1   0   0   0

arq5x commented 1 year ago

Random access into the matrix

for i in range(100):
    rand_start = numpy.random.randint(chrom_len)
    for values in matrix.enumerate_values("1", rand_start, rand_start+1):
        print(values)

jmschrei commented 1 year ago

Ah, interesting, thank you. So you'd be writing one cell and chromosome at a time into it? How long do you think that would take to do? And is there a way to get a sparse matrix out?

arq5x commented 1 year ago

Yes, but you could concatenate chromosomes and do the modular arithmetic to know what chrom you are on if you wanted. As for sparse matrix, I am not sure. Natively, it is not, but the API will yield numpy arrays so you may be able to cobble something. @38 may know of better options

jmschrei commented 1 year ago

Yes, but you could concatenate chromosomes and do the modular arithmetic to know what chrom you are on if you wanted.

Sorry, can you elaborate a little here?

38 commented 1 year ago

Yes, but you could concatenate chromosomes and do the modular arithmetic to know what chrom you are on if you wanted. As for sparse matrix, I am not sure. Natively, it is not, but the API will yield numpy arrays so you may be able to cobble something. @38 may know of better options

For loading data, pyd4 currently supports load data from all tracks within given region to numpy as 2d array directly.

matrix["chr1:1-1000000"]  #This will load data within region chr1:1-1000000 across all samples and return it as 2d numpy array

For creating multi-track D4 file, currently the API is a little bit annoying. You need to create the single track D4 files and then merge all of them. But this can be improved by gradually merge the D4 file - and I think we can make the change to support that (with this the amount of disk space needed to create multi-track D4 file should be reduced).

arq5x commented 1 year ago

Thanks for reminding of stepwise merging, @38! The problem I am facing now is that if I have 1000 tracks, I run into an OS error with too many files open. I wonder if there is a way to avoid that by batched opens to populate the matrix?

38 commented 1 year ago

Thanks for reminding of stepwise merging, @38! The problem I am facing now is that if I have 1000 tracks, I run into an OS error with too many files open. I wonder if there is a way to avoid that by batched opens to populate the matrix?

Yes, I just fixed a bug that d4tools can not find all tracks defined in multitrack d4 file recently. Actually I think this may be the only blocker for is to support step merging.

arq5x commented 1 year ago

Nice! Can you push the change so we can test?

38 commented 1 year ago

Sure, will let you guys know the change is ready for test.

38 / d4-format

[ENH] Is there utility for single-cell genomics? #27