Closed paolaroncaglia closed 2 years ago
Note for self: I've added this ticket to the agenda for the next monthly single-cell curators call.
'Single cell sequencing' could encompass many technologies. Wikipedia has scATAC-seq and Single-cell DNA methylome sequencing as well as scRNAseq. So, we could make 'single cell sequencing' a grouping term, but then 'single cell sequencing' would need to be moved as currently: it is under RNA assay:
At curators meeting today, @dosumis suggested to revisit this by involving others - of interest for CAP too, so I'll tag Bradley too.
Silvie said that SCA isn't using those 2 EFO terms at the moment, so changes shouldn't affect them. But note, no-one from ArrayExpress/Gene Expression Atlas attended the meeting or commented on this ticket, so we still don't know if changes could cause issue to them.
@bvarner-ebi
@dosumis I can't co-assign Bradley on EFO tickets
@bvarner-ebi should now have relevant permissions.
Hi @sfexova @ngeorgeebi @anjaf , On the topic of single cell sequencing vs single cell RNA sequencing, could one of you please confirm if the strategy suggested by @bvarner-ebi here would work well for ArrayExpress/Gene Expression Atlas? We know that the Single Cell Expression Atlas isn't using those terms at the moment, but we'd like to ensure for all parties involved. If Anja is no longer the point of contact, could you please point us to others? Thank you. Paola and Bradley
Sorry for the confusion. You can go ahead with this. There are no issues from ArrayExpress/GXA side either.
Thank you @anjaf !
Thank you, all, for the feedback and reviews!
@bvarner-ebi Re-opening this issue as, based on the very latest release of EFO, the 'single cell sequencing' branch could be cleaned up further. In particular
And looking at the inferred view, looks like two more terms should be subclasses of 'single-cell RNA sequencing' please
This is not urgent as it will go into the next EFO release on April 19th. Let me know if you'd prefer me to do the edits. Thank you.
@paolaroncaglia, I will clean this up before the next release. Thank you for taking a closer look.
@bvarner-ebi thank you, for taking care of these edits.
Sorry to jump in here again. ArrayExpress is using the "RNA-seq of coding RNA from single cells" and "RNA-seq of non coding RNA from single cells" and those term labels should not be changed if possible. I also disagree with hyphenating as the "norm" for EFO terms. Most (if not all) occurrences of "single cell" in EFO are with space not hyphen.
Thanks for the feedback, @anjaf. Do you have any objections to reorganising with respect to the subclasses?
For 'single cell sequencing', I consciously did not hyphenate it in this recent round of edits. Wikipedia does not hyphenate it, and it is ambiguous (to me) if single cell is serving as a compound adjective here... I read it as 'single' describing the cell as opposed to 'single cell' describing the sequencing.
However, I do think single-cell should be hyphenated in 'single-cell RNA sequencing' since I read it as a compound adjective... I read it as single-cell describing the RNA. Also, Wikipedia does hyphenate it.
For both terms, I see a mix of hyphenated and not hyphenated with a cursory web search. Arguments could be made for both formats.
@paolaroncaglia, do you have any strong inclinations for one way or the other?
The hierarchical arrangement is fine. That doesn't impact us, as long as it is under the "RNA assay" branch.
Regarding the hyphen, I agree that grammatically the compound phrases should be hyphenated to make the meaning unambiguous. In that sense also the "non coding" in "RNA-seq of non coding RNA from single cells" should probably be hyphenated by English grammar rules. But I think it goes against general EFO style that is using predominantly spaces. Hence, I would be suggesting to keep everything "single cell" space-separated for consistency.
@anjaf @bvarner-ebi cc @dosumis for the discussion on labels vs. IDs I lean towards hyphens every time a two-word string is used as a compound term, and I like consistency in an ontology. But I've seen native speakers and writers forgo hyphens, and as long as the meaning is unambiguous, I'm fine with either. What should really concern us here is Anja's comment that "ArrayExpress is using the "RNA-seq of coding RNA from single cells" and "RNA-seq of non coding RNA from single cells" and those term labels should not be changed if possible". Do AE's tools and pipelines rely on ontology labels rather than IDs? That is not ideal. We've had that discussion previously with the Single Cell Expression Atlas group and I now suspect that we never reached a resolution... I'll link the previous ticket here when I find it.
Here are two tickets partly related to the discussion: https://github.com/EBISPOT/efo/issues/934 https://github.com/EBISPOT/efo/issues/959
And here's a link to the previous discussion: https://github.com/obophenotype/cell-ontology/issues/792#issuecomment-763710300
Relevant exchange between me and Nancy from the SCEA: "To avoid blocks, I can assume that the SCA pipeline issue will be addressed and that any necessary change in term labels will be acceptable." "yes, please go ahead and we will deal with any issues further down the line." @anjaf please let us know if this is true for AE too or not, thank you. It's difficult to commit to never change ontology labels, especially if we don't have a clear and easy way of tagging EFO terms used by AE. I think we have a list somewhere but I'm not sure it's maintained.
Hi @paolaroncaglia ,
Do AE's tools and pipelines rely on ontology labels rather than IDs? That is not ideal.
Unfortunately, yes. We use the term labels in our metadata files without the ontology ID for the ArrayExpress experiment type. I agree this is less than ideal. But ArrayExpress is quite an old legacy system and we have no funding to make such changes at the moment. If the term labels change, I'm not sure if we can make this backwards compatible. We haven't had that situation yet. It would break a few things like experiment type searches with ontology expansion in ArrayExpress: https://www.ebi.ac.uk/arrayexpress/search.html?query=%22RNA-seq+of+coding+RNA+from+single+cells%22
The full list of experiment type terms we are using in Annotare/ArrayExpress is here: https://www.ebi.ac.uk/fg/annotare/help/experiment_types.html
We've had that discussion previously with the Single Cell Expression Atlas group and I now suspect that we never reached a resolution...
I'm afraid these are independent of each other, ArrayExpress curation and pipelines are not directly connected with the Single Cell Expression Atlas work.
... should probably be hyphenated by English grammar rules. But I think it goes against general EFO style that is using predominantly spaces. Hence, I would be suggesting to keep everything "single cell" space-separated for consistency.
@anjaf, thank you for the additional feedback. Is there a style guide or other reference for EFO that details the "general EFO style"? This type of reference would be helpful for editors. I did a quick check and do not see this particular issue addressed in OBO Foundry naming conventions. It is mentioned to use spaces to separate words, but one may argue that a compound word is one word.
@zoependlington, @dosumis, do you have any input on hyphenating compound words in labels vs separating the words with spaces? E.g., 'single-cell RNA sequencing' vs 'single cell RNA sequencing'
@bvarner-ebi We tend to follow OBO naming conventions as far as possible, so there isn't any specific EFO rule regarding this. Typically, I think we go for spaces, but there are a few terms in EFO with a hyphen (e.g. EFO_0030053). I think as long as there is a non-hyphenated version as a synonym then all should be well. Granted, @paolaroncaglia has worked more on the sequencing branch than I have so she may have a different opinion.
@anjaf (and @bvarner-ebi , see action item below please) hi, thank you for your feedback:
Do AE's tools and pipelines rely on ontology labels rather than IDs? That is not ideal.
Unfortunately, yes. We use the term labels in our metadata files without the ontology ID for the ArrayExpress experiment type. I agree this is less than ideal. But ArrayExpress is quite an old legacy system and we have no funding to make such changes at the moment. If the term labels change, I'm not sure if we can make this backwards compatible. We haven't had that situation yet. It would break a few things like experiment type searches with ontology expansion in ArrayExpress: https://www.ebi.ac.uk/arrayexpress/search.html?query=%22RNA-seq+of+coding+RNA+from+single+cells%22
The full list of experiment type terms we are using in Annotare/ArrayExpress is here: https://www.ebi.ac.uk/fg/annotare/help/experiment_types.html
Ok, so we can assume that, for any term not in that list, it is ok to change the label if needed. Note, we tend to only change labels if they're incorrect, or to ensure consistency when reasonable. @bvarner-ebi , as this ticket is now assigned to you, could you please
@anjaf , also, are there other EFO terms, other than the ones in the experiment type list, that Array Express refers to by labels rather than ID? Zoë and I kept a longer list of Annotare terms, but I don't think I can access that doc anymore - and it'd be good to have an updated/confirmed list please, for any potential EFO editor. Thank you.
@zoependlington @bvarner-ebi
@bvarner-ebi We tend to follow OBO naming conventions as far as possible, so there isn't any specific EFO rule regarding this. Typically, I think we go for spaces, but there are a few terms in EFO with a hyphen (e.g. EFO_0030053). I think as long as there is a non-hyphenated version as a synonym then all should be well. Granted, @paolaroncaglia has worked more on the sequencing branch than I have so she may have a different opinion.
No objection there, thank you.
Thanks for double-checking the "assay branch" for us! That is the only critical bit where we rely on the labels. The other terms are less critical and won't break any pipelines if the label changes. For those, Annotare uses the ontology IDs, rather than labels, so we'd only have an issue if terms get deprecated. But even then, it's fine as long as we're kept in the loop and can make the changes in the pipeline. I also remember that we had a "Annotare EFO check list" but also not sure what happened to that. This file from Annotare's code contains the list of EFO accessions that are currently hard-coded on our end: https://github.com/arrayexpress/annotare2/blob/master/app/webapp/src/main/resources/Annotare-default.properties
I think for anything Expression Atlas curation related, label changes are fine too. These will get updated via the Zooma mappings.
[ ] Double-check that all labels in https://www.ebi.ac.uk/fg/annotare/help/experiment_types.html are unchanged in the latest version of EFO, just in case? Thanks in advance.
and
This file from Annotare's code contains the list of EFO accessions that are currently hard-coded on our end: https://github.com/arrayexpress/annotare2/blob/master/app/webapp/src/main/resources/Annotare-default.properties
For clarity, should I be checking both lists noted above for changed labels?
@bvarner-ebi
[ ] Double-check that all labels in https://www.ebi.ac.uk/fg/annotare/help/experiment_types.html are unchanged in the latest version of EFO, just in case? Thanks in advance.
and
This file from Annotare's code contains the list of EFO accessions that are currently hard-coded on our end: https://github.com/arrayexpress/annotare2/blob/master/app/webapp/src/main/resources/Annotare-default.properties
For clarity, should I be checking both lists noted above for changed labels?
No, just the top one (https://www.ebi.ac.uk/fg/annotare/help/experiment_types.html). Thanks for checking.
@paolaroncaglia, @anjaf,
[x] Double-check that all labels in https://www.ebi.ac.uk/fg/annotare/help/experiment_types.html are unchanged in the latest version of EFO, just in case
I spotted one difference: EFO_0005517, which is 'RIP-chip by array' in the list while the label is 'RIP-Chip by array' in efo. Is letter case an issue here: 'c' vs 'C'?
I also noted inconsistency with the use of hyphens vs spaces in the labels, e.g., 'non coding' vs 'high-throughput'. 'single nucleus RNA sequencing' has a space between single and nucleus.
Do we have consensus on whether or not to keep the label as 'single-cell RNA sequencing' (with the hyphen)? It does not appear on the Annotare list.
So, is the current plan as follows?
[ ] Keep 'single-cell RNA sequencing' as the label and add 'single cell RNA sequencing' as an exact synonym. Do not change hyphens to spaces or vice versa on any terms for now to avoid unintended downstream effects in ArrayExpress.
[ ] Set the following to be subclasses of 'single-cell RNA sequencing': 'full length single cell RNA sequencing' 'RNA-seq of coding RNA from single cells' 'RNA-seq of non coding RNA from single cells' 'tag based single cell RNA sequencing'
@bvarner-ebi @anjaf
[x] Double-check that all labels in https://www.ebi.ac.uk/fg/annotare/help/experiment_types.html are unchanged in the latest version of EFO, just in case I spotted one difference: EFO_0005517, which is 'RIP-chip by array' in the list while the label is 'RIP-Chip by array' in efo. Is letter case an issue here: 'c' vs 'C'?
@anjaf please?
I also noted inconsistency with the use of hyphens vs spaces in the labels, e.g., 'non coding' vs 'high-throughput'. 'single nucleus RNA sequencing' has a space between single and nucleus.
Yes, a typical issue when an ontology is edited by multiple pairs of hands and when the literature/common usage themselves do not use consistent wording. That's acceptable. I'd leave everything as is, it'd be hard to reach a wide consensus.
Do we have consensus on whether or not to keep the label as 'single-cell RNA sequencing' (with the hyphen)? It does not appear on the Annotare list.
In this case, I'd leave 'single-cell RNA sequencing' (it doesn't damage any pipeline and you don't have to un-do the edit).
So, is the current plan as follows? [ ] Keep 'single-cell RNA sequencing' as the label and add 'single cell RNA sequencing' as an exact synonym. Do not change hyphens to spaces or vice versa on any terms for now to avoid unintended downstream effects in ArrayExpress.
I agree
[ ] Set the following to be subclasses of 'single-cell RNA sequencing': 'full length since cell RNA sequencing' 'RNA-seq of coding RNA from single cells' 'RNA-seq of non coding RNA from single cells' 'single-cell RNA sequencing' 'tag based single cell RNA sequencing'
I agree, Anja confirmed that the above rearrangement wouldn't cause issues to them, as all terms would still be descendants of RNA assay. Thank you.
I spotted one difference: EFO_0005517, which is 'RIP-chip by array' in the list while the label is 'RIP-Chip by array' in efo. Is letter case an issue here: 'c' vs 'C'?
Well spotted! We're indeed using the lower case version. Probably different casing doesn't cause issues in our pipeline. https://www.ebi.ac.uk/arrayexpress/search.html?query=%22RIP-Chip+by+array%22+ Was this always the case? Then the difference didn't have any impact on our pipelines.
I spotted one difference: EFO_0005517, which is 'RIP-chip by array' in the list while the label is 'RIP-Chip by array' in efo. Is letter case an issue here: 'c' vs 'C'?
Well spotted! We're indeed using the lower case version. Probably different casing doesn't cause issues in our pipeline. https://www.ebi.ac.uk/arrayexpress/search.html?query=%22RIP-Chip+by+array%22+ Was this always the case? Then the difference didn't have any impact on our pipelines.
It seems like it was written with 'C' at least since Oct 2021: https://github.com/EBISPOT/efo/issues/1296
I could not find the original term request.
@anjaf, @paolaroncaglia, thank you for all the feedback on this ticket. The changes made are detailed in PR #1483.
@bvarner-ebi @anjaf
I spotted one difference: EFO_0005517, which is 'RIP-chip by array' in the list while the label is 'RIP-Chip by array' in efo. Is letter case an issue here: 'c' vs 'C'?
Well spotted! We're indeed using the lower case version. Probably different casing doesn't cause issues in our pipeline. https://www.ebi.ac.uk/arrayexpress/search.html?query=%22RIP-Chip+by+array%22+ Was this always the case? Then the difference didn't have any impact on our pipelines.
It seems like it was written with 'C' at least since Oct 2021: #1296
I could not find the original term request.
EFO trivia: the term was created by Drashtti, who left EBI in 2015. It's possible that the label has stayed that way since then.
@anjaf @sfexova @ngeorgeebi @pnejad Cc @zoependlington @dosumis
There are 2 terms in EFO
EFO:0007832 'single cell sequencing'
EFO:0008913 scRNA-seq
Both are descendants of ‘RNA assay’, and their children assay RNA too. The two terms seem to mean the same thing, i.e. single-cell RNA sequencing. I suggest to
Do you also think that the two terms mean the same? If they were merged, would that create any issue to ArrayExpress/GXA/SCA/HCA?
If concepts referring to non-RNA sequencing from single cells are needed in the future, they can be created on an ad-hoc basis.
This would also address @LTLA’s request here https://github.com/EBISPOT/efo/issues/1034. Thank you.
(Note for self, this ticket is an update of https://github.com/EBISPOT/efo/issues/887.)