HumanCellAtlas / ontology

3 stars 1 forks source link

Ontology term for Cell Barcode files #97

Closed ipediez closed 2 years ago

ipediez commented 2 years ago

Hi!

We've been discussing at the HCA wrangler meetings about the best ontology choice for different file types that we work with. One of them is particularly difficult and it might need a new ontology term: the Cell Barcodes files coming from 10X sequencing.

Those files contain a list of the cell barcodes in the experiment, which are unique identifiers of each one of the cells analyzed in that experiment. They are combined with a file containing a list of genes (which we tag as a feature table, data:1270) and a file containing the expression values and the links to the gene and the barcodes files (which we tag as a count matrix, data:3917). Until now, we've been tagging cell barcodes files as Molecular property Identifier (data:2110), but we've realized that this term is related to Molecular property (data:2087), defined as "A report on the physical (e.g. structural) or chemical properties of molecules, or parts of a molecule". A cell barcode is a nucleic acid sequence identifying an individual cell, therefore Molecular property identifier does not seem the right choice.

We've been looking at the ontologies and haven't found a term that we could use.

Do you have some term suggestions, or is it better to create a new term? In case a new term is needed, how much do you think it would take to release it?

paolaroncaglia commented 2 years ago

Hi @ipediez (cc @zoependlington and @pnejad ),

Thanks for providing informative details. EFO has EFO:0010198 'cell barcode' ("A short nucleotide sequence that is added during a single cell library preparation to identify reads from an individual cell."), along with other terms related to cell barcode that may be useful (see a list here). However, EFO:0010198 'cell barcode' isn't currently available to HCA because it's a descendant of 'single cell information', itself a child of 'information entity', a branch that isn't currently part of the EFO subset for HCA. I suspect that several terms in the branch that starts at 'single cell information' may be useful to HCA; I attach a screenshot below, and you may further explore the branch here. My suggestion would be to include the 'single cell information' branch in the EFO subset for HCA (if not the whole 'information entity' branch). However, I wonder if there are reasons why this wasn't done at the "onset" of the HCA application ontology. There may have been previous discussion before I joined the project.

@zoependlington , please comment if you're aware of any constraint or objection to that inclusion (other than ontology size); if not, would it be possible to modify the HCAO build pipeline in time for the next HCAO release (scheduled for Feb 16th)? That would make 'cell barcode' available to HCA wranglers ~Feb 18th.

@ipediez and @pnejad , please comment on any constraint from the HCA pipeline side - I'm not too clear on where you'd expect the 'cell barcode' term to be placed in the ontology. If I understand correctly, you've been using an EDAM term so far, but you mentioned that it may not be the right choice.

Screenshot 2022-02-02 at 12 16 46

zoependlington commented 2 years ago

@paolaroncaglia We can certainly extend the EFO slim that's released with the HCAO releases, I'm not aware of any reason why we couldn't!

ESapenaVentura commented 2 years ago

@zoependlington same here, as I understood the HCAO slim serves exactly for this purpose :)

Our only constraint is that we would need to update the metadata schema to allow for these terms to be validated.

About what you comment on the hierarchy, not sure either. Maybe @pnejad has a better thought on this, but I don't see why not include the whole "information entity" branch. It does seem to contain quite useful terms, even if they don't serve of any use for us right now.

pnejad commented 2 years ago

@ESapenaVentura Do you mean updating the metadata schema to allow validation of EFO ontologies for content descriptions (which currently only allows EDAM ontologies)?

ESapenaVentura commented 2 years ago

Spot on @pnejad 💯 that is exactly what I meant

paolaroncaglia commented 2 years ago

Hi @ipediez @pnejad @ESapenaVentura cc @zoependlington @dosumis Summing up,

Please comment if you have any concern or additional suggestion. Otherwise, I'll update this ticket after the next HCAO release.

dosumis commented 2 years ago

We could add the whole 'single cell information branch' here:

https://github.com/HumanCellAtlas/metadata-schema/blob/ad20343b331abdde586037605ea091af38f2554e/json_schema/module/ontology/file_content_ontology.json#L30

This could work:

            "graph_restriction":  {
                "ontologies" : ["obo:edam", "obo:EFO"],
                "classes": ["data:0006", "EFO:0010185" ],
                "relations": ["rdfs:subClassOf"],
                "direct": false,
                "include_self": false
            },

@zoependlington to add branch to HCAO. This needs to happen before the metadata schema is modified.

pnejad commented 2 years ago

@zoependlington Please go ahead and add single cell information branch to HCAO :)

zoependlington commented 2 years ago

Will do @pnejad! Thanks for confirming 😁

paolaroncaglia commented 2 years ago

Hi @ipediez @pnejad @ESapenaVentura,

The whole 'information entity' branch, including child 'single cell information' and descendant 'cell barcode', is now in the HCA application ontology. It can be browsed here.

As agreed, you should now please ensure that the HCA metadata schema is updated to allow validation of EFO ontology classes for content descriptions. Should this task be recorded somewhere else? If so, this ticket may be closed. Thank you.

Paola

paolaroncaglia commented 2 years ago

@pnejad please, I trust that you'll inform all HCA wranglers that this new branch of HCAO is available to them? We didn't mention it in the HCAO release notes - I'm not sure if those are referenced often, I presume they may not be. Thanks.

paolaroncaglia commented 2 years ago

@ipediez (cc @pnejad and @ESapenaVentura ) Based on a separate, more recent ticket, I trust that the HCA metadata schema now allows validation of terms in the 'information entity' branch, so I'll close this ticket. Thanks.