EBISPOT / efo

Github repo for the Experimental Factor Ontology (EFO)
https://www.ebi.ac.uk/efo/
56 stars 13 forks source link

Suggestion to re-structure the "10x sequencing" library construction terms #1091

Closed anjaf closed 3 years ago

anjaf commented 3 years ago

Hi Paola,

As discussed in our last meeting, ArrayExpress and Expression Atlas curation team is proposing some changes to the 10x library construction EFO branch.

1) As we discussed, using the term "sequencing" here is conceptually not quite right because these are "single cell library construction" methods and not directly related to how the resulting libraries are sequenced. Hence, we suggest to remove the "sequencing" substring from all term labels for "10x" http://www.ebi.ac.uk/efo/EFO_0008995 and all child terms.

2) Using the versioning of the different 10x gene expression as the main sub-categories causes some problems, in particular for the 5' technology, which is called version 1 but uses the same library strategy as the 10x 3' v2. Therefore, we suggest creating a separation between the 3 prime and 5 prime methods as the top layer instead of the versioning and have the versioned terms below these. The final branch will look something like this:

10x    
  10x 3' transcription profiling:  
    10x 3' v1
    10x 3' v2
    10x 3' v3
  10x 5' transcription profiling:  
    10x 5' v1
    10x 5' v2
  10x immune profiling:  
    10x Ig enrichment
    10x TCR enrichment

Action points for this change are:

New term label Parent term Definition
10x 3' transcription profiling "10x" http://www.ebi.ac.uk/efo/EFO_0008993 10x 3' transcription profiling is the 10x based single-cell technology that sequences mRNA molecules from their 5' end.
10x 5' transcription profiling "10x" http://www.ebi.ac.uk/efo/EFO_0008994 10x 5' transcription profiling is the 10x based single-cell technology that sequences mRNA molecules from their 5' end.
Term accession New parent New term label
http://www.ebi.ac.uk/efo/EFO_0010713 EFO_0008995 (10x) 10x immune profiling
http://www.ebi.ac.uk/efo/EFO_0009901 (new term for "10x 3' transcription profiling") 10x 3' v1
http://www.ebi.ac.uk/efo/EFO_0009899 (new term for "10x 3' transcription profiling") 10x 3' v2
http://www.ebi.ac.uk/efo/EFO_0009922 (new term for "10x 3' transcription profiling") 10x 3' v3
http://www.ebi.ac.uk/efo/EFO_0009900 (new term for "10x 5' transcription profiling") 10x 5' v2

"10x 5' v3" http://www.ebi.ac.uk/efo/EFO_0009921 There is no version 3 of this protocol available from 10x (yet). The term might have been created with the "10x 3' v3" library construction method in mind, creating an equivalent 5' method. But this is actually already covered with "10x 5' v1" http://www.ebi.ac.uk/efo/EFO_0011025.

paolaroncaglia commented 3 years ago

Hi @anjaf (cc @ngeorgeebi and @sfexova), In reply to "we suggest to remove the "sequencing" substring from all term labels for "10x" http://www.ebi.ac.uk/efo/EFO_0008995 and all child terms". I understand and agree with the rationale behind this. However, for the top-level term '10x sequencing', my concern is that the simple label '10x' is highly ambiguous. External curators or data scientists using a different pipeline than yours and/or relying on text matching methods would come across many spurious hits, and risk missing the right term. I'd suggest to rename '10x sequencing' as '10x single cell library construction' or '10x technology', or any wording that you may suggest and that's more informative than 10x to the broader user. :-) We can add as many synonyms as you deem useful. '10x' could stay as a broad or related synonym, but preferably not as an exact one. Would that still work for your pipeline? Please let me know your thoughts and I can address the whole branch accordingly. Thanks, Paola

paolaroncaglia commented 3 years ago

Notes for self:

The renaming and repositioning should not cause issues for HCA, because all terms involved would still be in the same branch.

Based on Anja's comment "Remove term: "10x 5' v3" http://www.ebi.ac.uk/efo/EFO_0009921 There is no version 3 of this protocol available from 10x (yet). The term might have been created with the "10x 3' v3" library construction method in mind, creating an equivalent 5' method. But this is actually already covered with "10x 5' v1" http://www.ebi.ac.uk/efo/EFO_0011025." =>

anjaf commented 3 years ago

In reply to "we suggest to remove the "sequencing" substring from all term labels for "10x" http://www.ebi.ac.uk/efo/EFO_0008995 and all child terms". I understand and agree with the rationale behind this. However, for the top-level term '10x sequencing', my concern is that the simple label '10x' is highly ambiguous. External curators or data scientists using a different pipeline than yours and/or relying on text matching methods would come across many spurious hits, and risk missing the right term. I'd suggest to rename '10x sequencing' as '10x single cell library construction' or '10x technology', or any wording that you may suggest and that's more informative than 10x to the broader user. :-) We can add as many synonyms as you deem useful. '10x' could stay as a broad or related synonym, but preferably not as an exact one. Would that still work for your pipeline?

Hi @paolaroncaglia , we agree that "10x" is not specific enough as a main label. We can go with 10x technology and also have 10x single cell library construction and 10x Genomics as synonyms to specify further.

paolaroncaglia commented 3 years ago

@anjaf FYI, I'll keep the current, "sequencing"-containing, labels (10x sequencing etc.) as related synonyms, for legacy purposes.

paolaroncaglia commented 3 years ago

Hi @anjaf, I've implemented the changes you requested, except for removing (obsoleting/merging) "10x 5' v3" http://www.ebi.ac.uk/efo/EFO_0009921. However, there are a few terms for which I couldn't find explicit instructions (or I may have missed them). I'll attach a screenshot of the current view in Protege, could you please let me know if 10x v1, 10x v2 and 10x v3 are ok to stay or if they should also be obsoleted like 10x 5' v3? Thanks, Paola Screen Shot 2021-05-14 at 16 51 10

paolaroncaglia commented 3 years ago

P.S. The new terms' IDs are EFO_0030003 (10x 3' transcription profiling) EFO_0030004 (10x 5' transcription profiling)

anjaf commented 3 years ago

Good point. In my opinion, the "10x v1, 10x v2 and 10x v3" terms can be obsoleted too, as they technically refer to the 3' versions of the protocol (at the time when we didn't have the distinction between 3' and 5' yet). They can be perhaps kept as synonyms of the corresponding 3' terms, because they used to be the synonym of those terms.

paolaroncaglia commented 3 years ago

Thanks Anja. The changes in the screenshot above relative to the 10x branch will be visible in the next EFO release scheduled for Monday May 17th. But the obsoletions will need to wait for the following (June) EFO release. I trust that won't be a problem as the correct structure is now in place so you may use the new/newly positioned terms instead. Paola

paolaroncaglia commented 3 years ago

Notes for self: left to do for this ticket:

mshadbolt commented 3 years ago

These changes look really positive and address a lot of the issues I have noticed but not progressed to a solution too, super happy to see the changes progressing.

One other thing that could be added to the 10x branch would be feature barcoding that we are starting to see done in combination with immune profiling/VDJ libraries where antibody tagging is used to do a proteogenomics assay at the same time as gene expression and TCR/BCR libraries.

One other thing I have been pondering lately is that gene expression that is done as part of immune/profiling VDJ is usually VDJ v1, or v1.1, but the gene expression that is done as part of it is generally the same as 10 v2 5', so it makes it a little confusing.

Also the fact we are now seeing things like v3.1 etc, do we need extra terms for these or are they close enough the v3 version? I haven't looked deeply into this.

Anyways apologies for the brain dump, I just saw this ticket and thought I'd put in my two cents before leaving...

paolaroncaglia commented 3 years ago

Hi @mshadbolt , Thanks for your feedback. I'll leave it to @anjaf to comment on the confusing "version" issues, and to let me know if further ontology edits are suggested. Meanwhile, FYI Anja requested feature barcoding terms in https://github.com/EBISPOT/efo/issues/1092 which I haven't addressed yet, so Marion please feel free to comment there if you have any feedback, your suggestions are always welcome. I'm sorry that you're leaving. Thank you! Best, Paola

anjaf commented 3 years ago

Thanks for checking this @mshadbolt! Yes, the feature barcoding will be added too.

Regarding the VDJ + gene expression combinatorial experiments, indeed, the individual libraries will have to be annotated separately with the corresponding terms. So this shouldn't be a problem. Regarding the different versions, I have understood it the following way: The 10x 5prime v1 chemistry uses the same library configuration as the 10x 3prime v2 chemistry (e.g. with cDNA in read2) and is majorly different from the 10x 3prime v1 chemistry. With the "10x V2 5" term in EFO, we actually mean "10x 5' v1", the first version of the 5 prime protocol, where the library configuration happens to match with the "10x 3' v2" . We currently have the 10x version imply the library configuration and are sorting the protocols into these categories. With the introduction of the 5 prime protocols this started to cause problems. Going by the library layout and calling it "10x 5' v2" is confusing because the protocol people have used clearly says v1 or v1.1. Thus, it makes more sense to first distinguish between the 3' vs 5' protocols because this is how 10x developed and versioned them. Note that I only left the "10x 5' v2" term, in the list because I noticed that this is already available as a new product (supporting dual indexing) but I haven't seen it submitted yet.

For simplicity, I would stay away from adding more terms for the sub-versions like v1.1, if there is no change in the library layout or the way the data needs to be analysed. If this was the case, a new term would be in order.

PS: Good luck and all the best for your future, Marion! :)

rays22 commented 3 years ago
  • As we discussed, using the term "sequencing" here is conceptually not quite right because these are "single cell library construction" methods and not directly related to how the resulting libraries are sequenced. Hence, we suggest to remove the "sequencing" substring from all term labels for "10x" http://www.ebi.ac.uk/efo/EFO_0008995 and all child terms.

I would like to confirm that the proposed term label changes make sense to me. Recently it has come to my attention that some of the HCA components rely on the term labels computationally and not on the machine readable identifiers. Term labels (current at the time of data ingestion) are stored in the HCA metadata together with the machine readable ontology IDs. If term labels change, then there will be some discrepancy between the labels in older datasets compared with the newer labels in the datasets from the period after the term label updates. I think using the more stable machine readable term identifiers can mitigate all the inconveniences caused by term label changes, but I can not comment on the software development costs of switching from using term labels computationally to using the stable identifiers. I will try to follow that up with my HCA collegaues.

paolaroncaglia commented 3 years ago

Hi @rays22 ,

I would like to confirm that the proposed term label changes make sense to me.

Thanks for confirming.

Recently it has come to my attention that some of the HCA components rely on the term labels computationally and not on the machine readable identifiers. Term labels (current at the time of data ingestion) are stored in the HCA metadata together with the machine readable ontology IDs. If term labels change, then there will be some discrepancy between the labels in older datasets compared with the newer labels in the datasets from the period after the term label updates. I think using the more stable machine readable term identifiers can mitigate all the inconveniences caused by term label changes, but I can not comment on the software development costs of switching from using term labels computationally to using the stable identifiers. I will try to follow that up with my HCA collegaues.

Yes, please do follow up with your colleagues (I do understand that several people may be on leave this week). This is a very important issue, and hopefully HCA components may be switched to fully rely on terms' stable IDs rather than labels. While we may have some control on keeping EFO labels as they are if requested by HCA, we don't have that control on terms in other ontologies (CL, Uberon, Mondo, EDAM...). Labels may change for a number of good reasons (e.g. to make them consistent with other labels or more meaningful/less ambiguous, or because there's a typo). Editors don't modify labels unless the change is considered beneficial, so name changes don't happen very frequently, but they do happen, and it's very important that pipelines can accommodate this. On the contrary, term IDs don't change and don't get deleted (in most reliable ontologies). Terms may become obsoleted or be merged with others, but their IDs are never recycled, and the obsoletion/merge process is usually well documented to enable pipelines to refer to the more correct choices. So terms' IDs are stable and reliable and should be the ideal choice for referring to terms in pipelines. I made a note to discuss this at our next monthly meeting if the issue isn't addressed before then.

Thanks, Paola

paolaroncaglia commented 3 years ago

Update re. changing term labels and implications for HCA (see previous 2 comments in this thread): At the curators/wranglers meeting on June 9th, we resolved for @rays22 "to suggest to data portal devs to use OLS API lookup to correct labels".

paolaroncaglia commented 3 years ago

Hi @anjaf (cc @ngeorgeebi, @sfexova and @rays22 ), The remaining edits in this ticket have been done, and will be public after tomorrow's EFO release. Thanks.