Sage-Bionetworks / sysbioDCCjsonschemas

SysBio DCC JSON schemas
1 stars 7 forks source link

Handling data files in h5ad file format #138

Open calkinsh opened 2 years ago

calkinsh commented 2 years ago

A contributor in PEC would like to provide files in h5ad file format and was wondering how this should be handled on the manifest. This issue can be used to track the conversation for this topic.

calkinsh commented 2 years ago

From Will Poehlman and Laura Heath: " Will: I can't speak too much on it since I've never worked with those files, but my gut would tell me that it would be fine for distributing analysis-ready files. If you want to enable reprocessing of the data at any point however, you would want to get fastq files onto Synapse. I might suggest pinging Laura Heath who has a lot more experience with the analysis end of things to see if they've worked with this format specifically

Laura: i worked a tiny bit with h5 data a long time ago though i’m not really familiar with it, and it looks like h5ad is pretty much the same (except that it’s a python program output). there are several R packages to upload h5 and h5ad data appropriately, it shouldn’t be a problem and is probably the most efficient way to store files from individual samples.

optimally, we’d want an accompanying metadata file too (for clinical features). and it’d be super optimal if they could also output a combined matrix (all the samples together after qc)--but not knowing anything about PEC, maybe individual files are ok.

Hannah- Will suggested that for reprocessing/analysis we may consider also taking fastq files - do you think that would be a good approach? or would those be unnecessary?

Laura i haven’t worked with fastq files for single cell--so i don’t know how complicated it is to process those (h5 files have cell & gene features all in one place, it’s pretty easy to extract verything). but yeah, considering that some people do like to do their own processing from the beginning (including Will, who is creating consistent pipelines between samples from different groups), it might be good to have fastq on hand as well. "

calkinsh commented 2 years ago

@pitviper6 here is the issue for this topic

calkinsh commented 2 years ago

As another note, @danlu1 noticed PEC already has at least one h5ad file located here https://www.synapse.org/#!Synapse:syn25922812 and we can take a look at the annotations on that file to get a better understanding of how the manifest might be filled out. Generally speaking it seems like h5ad should be fine and fairly straightforward. We can pass on Laura's comments to the contributor to see what he thinks as far as the additional metadata/combined matrix/fastq files

amapeters commented 2 years ago

We already have the fastq files. The process has been as following. Individual teams have provided fastq files and metadata, the DAC has downloaded and processed the fastq using a common workflow and want to contribute that back in the h5ad format. What we need to make sure is that the h5ad files link back to the input fastq through provenance. It was also my understanding from the discussion with Prashant yesterday that they will be providing a matrix of the clinical/demographic metadata