ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

Modeling Single Cell RNA Quantifications #758

Open david4096 opened 7 years ago

david4096 commented 7 years ago

This issue is meant to spark discussion regarding what will be needed, if anything, to properly model single cell RNA samples: what ontologies will need to be used, and whether new indexed fields will be required in order to support basic RNA interchange use cases. @mbaudis @saupchurch

It seems to me that the large number of samples requires that we provide a way to filter biosamples using the cell type. Although there is other phenotype data about scRNA seq we may desire, I hope to limit the discussion to the minimum changes needed to interchange the data. I believe that after we add attributes to the RNA quantification data, we will be able to interchange arbitrary key value pairs that include ontology terms.

External Identifiers

Using external identifiers, as opposed to using specific featureIds when searching for expression levels, would also be useful. This would allow preparers to interchange the data available to them without needing to construct a server with complete referential integrity. For example, if the data for an scRNA seq gives ensembl IDs, one can request: {external_id: 'ENST0000', db: 'ensembl'}. https://github.com/ga4gh/schemas/issues/633

Gene symbols

If only gene symbols are provided, we should offer to applications a way to filter expression levels by that field, similar to sequence annotations. The goal being able to make interchange the old "table of sample names against gene symbols" in a useful way.

https://github.com/davismcc/scater SCESet provides a good basis for the minimum required interchange for constructing a useful analysis.