HumanCellAtlas / metadata-schema

This repo is for the metadata schemas associated with the HCA
Apache License 2.0
65 stars 32 forks source link

Adding field to delineate between single cell, bulk, etc. #76

Closed malloryfreeberg closed 6 years ago

malloryfreeberg commented 6 years ago

User suggests to clarify whether a set of samples is sequencing single cell versus bulk RNA sequencing, since we plan to have non-single cell type data in the future. This will likely be a field on the RNA schema.

JimKent commented 6 years ago

The absence/presence of assay.single_cell indicates this.

malloryfreeberg commented 6 years ago

Perhaps it would make sense to have single cell vs bulk be a field in terms of indexing. Would it be easier to search for "all single cell experiments with parameters X, Y, Z" if the "single cell" part was a concrete field? Otherwise, the query would have to do a search for whether an assay bundle contained single_cell.json or not, which seems like a potentially dangerous thing to rely on.

lauraclarke commented 6 years ago

I agree a flag field could be useful for downstream users rather than the presence/absence of a collection of other fields.

We should make it sequencing and not rna though as we are going to get bulk and single cell ATAC and I would be very surprised if it doesn't happy for other assays too over time

JimKent commented 6 years ago

I'll ask the indexing people here. I think the query on a field (of any type) existing is pretty easy.

JimKent commented 6 years ago

The check for a field existing is quite easy in Elastic. See https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-exists-query.html

lauraclarke commented 6 years ago

My concern isn't really an implementation one (most query services allow checking if multiple fields exist) It's more from a sustainability of the code/queries.

At the moment the combination of fields which need to be queried to tell the difference between bulk and single cell is quite small but as more assay types are added, and new technologies are developed, the range and combination of fields will expand and the nature of this query will get more complex.

This single field allows us to more easily distinguish and not need to update this query if a new assay type (say single cell epigenomics beyond ATAC) get added, the query doesn't need to change.

daniwelter commented 6 years ago

In v5, we have added a nucleic_acid_source enum with values including bulk cell, single cell and others. Is that sufficient to address this use case or is more work needed here?

malloryfreeberg commented 6 years ago

input_nucleic_acid_molecule now describes the precise type of molecule that is sequenced and nucleic_acid_source describes whether single cells, single nuclei, bulk cells, bulk nuclei, etc. were the source of the molecule. Currently the former is an ontology and the later is an enum. Closing this issue as it is solved.