EBISPOT / efo

Github repo for the Experimental Factor Ontology (EFO)
https://www.ebi.ac.uk/efo/
54 stars 14 forks source link

Merge EFO:0008913 scRNA-seq into EFO:0007832 'single cell sequencing'? #1446

Closed paolaroncaglia closed 2 years ago

paolaroncaglia commented 2 years ago

@anjaf @sfexova @ngeorgeebi @pnejad Cc @zoependlington @dosumis

There are 2 terms in EFO

EFO:0007832 'single cell sequencing'

EFO:0008913 scRNA-seq

Both are descendants of ‘RNA assay’, and their children assay RNA too. The two terms seem to mean the same thing, i.e. single-cell RNA sequencing. I suggest to

Do you also think that the two terms mean the same? If they were merged, would that create any issue to ArrayExpress/GXA/SCA/HCA?

If concepts referring to non-RNA sequencing from single cells are needed in the future, they can be created on an ad-hoc basis.

This would also address @LTLA’s request here https://github.com/EBISPOT/efo/issues/1034. Thank you.

(Note for self, this ticket is an update of https://github.com/EBISPOT/efo/issues/887.)

paolaroncaglia commented 2 years ago

Note for self: I've added this ticket to the agenda for the next monthly single-cell curators call.

dosumis commented 2 years ago

'Single cell sequencing' could encompass many technologies. Wikipedia has scATAC-seq and Single-cell DNA methylome sequencing as well as scRNAseq. So, we could make 'single cell sequencing' a grouping term, but then 'single cell sequencing' would need to be moved as currently: it is under RNA assay:

image

paolaroncaglia commented 2 years ago

At curators meeting today, @dosumis suggested to revisit this by involving others - of interest for CAP too, so I'll tag Bradley too.

Silvie said that SCA isn't using those 2 EFO terms at the moment, so changes shouldn't affect them. But note, no-one from ArrayExpress/Gene Expression Atlas attended the meeting or commented on this ticket, so we still don't know if changes could cause issue to them.

paolaroncaglia commented 2 years ago

@bvarner-ebi

paolaroncaglia commented 2 years ago

@dosumis I can't co-assign Bradley on EFO tickets

dosumis commented 2 years ago

@bvarner-ebi should now have relevant permissions.

paolaroncaglia commented 2 years ago

Hi @sfexova @ngeorgeebi @anjaf , On the topic of single cell sequencing vs single cell RNA sequencing, could one of you please confirm if the strategy suggested by @bvarner-ebi here would work well for ArrayExpress/Gene Expression Atlas? We know that the Single Cell Expression Atlas isn't using those terms at the moment, but we'd like to ensure for all parties involved. If Anja is no longer the point of contact, could you please point us to others? Thank you. Paola and Bradley

anjaf commented 2 years ago

Sorry for the confusion. You can go ahead with this. There are no issues from ArrayExpress/GXA side either.

paolaroncaglia commented 2 years ago

Thank you @anjaf !

ghost commented 2 years ago

Thank you, all, for the feedback and reviews!

paolaroncaglia commented 2 years ago

@bvarner-ebi Re-opening this issue as, based on the very latest release of EFO, the 'single cell sequencing' branch could be cleaned up further. In particular

Screenshot 2022-03-15 at 12 33 00

And looking at the inferred view, looks like two more terms should be subclasses of 'single-cell RNA sequencing' please

Screenshot 2022-03-15 at 12 36 35

This is not urgent as it will go into the next EFO release on April 19th. Let me know if you'd prefer me to do the edits. Thank you.

ghost commented 2 years ago

@paolaroncaglia, I will clean this up before the next release. Thank you for taking a closer look.

paolaroncaglia commented 2 years ago

@bvarner-ebi thank you, for taking care of these edits.

anjaf commented 2 years ago

Sorry to jump in here again. ArrayExpress is using the "RNA-seq of coding RNA from single cells" and "RNA-seq of non coding RNA from single cells" and those term labels should not be changed if possible. I also disagree with hyphenating as the "norm" for EFO terms. Most (if not all) occurrences of "single cell" in EFO are with space not hyphen.

ghost commented 2 years ago

Thanks for the feedback, @anjaf. Do you have any objections to reorganising with respect to the subclasses?

For 'single cell sequencing', I consciously did not hyphenate it in this recent round of edits. Wikipedia does not hyphenate it, and it is ambiguous (to me) if single cell is serving as a compound adjective here... I read it as 'single' describing the cell as opposed to 'single cell' describing the sequencing.

However, I do think single-cell should be hyphenated in 'single-cell RNA sequencing' since I read it as a compound adjective... I read it as single-cell describing the RNA. Also, Wikipedia does hyphenate it.

For both terms, I see a mix of hyphenated and not hyphenated with a cursory web search. Arguments could be made for both formats.

@paolaroncaglia, do you have any strong inclinations for one way or the other?

anjaf commented 2 years ago

The hierarchical arrangement is fine. That doesn't impact us, as long as it is under the "RNA assay" branch.

Regarding the hyphen, I agree that grammatically the compound phrases should be hyphenated to make the meaning unambiguous. In that sense also the "non coding" in "RNA-seq of non coding RNA from single cells" should probably be hyphenated by English grammar rules. But I think it goes against general EFO style that is using predominantly spaces. Hence, I would be suggesting to keep everything "single cell" space-separated for consistency.

paolaroncaglia commented 2 years ago

@anjaf @bvarner-ebi cc @dosumis for the discussion on labels vs. IDs I lean towards hyphens every time a two-word string is used as a compound term, and I like consistency in an ontology. But I've seen native speakers and writers forgo hyphens, and as long as the meaning is unambiguous, I'm fine with either. What should really concern us here is Anja's comment that "ArrayExpress is using the "RNA-seq of coding RNA from single cells" and "RNA-seq of non coding RNA from single cells" and those term labels should not be changed if possible". Do AE's tools and pipelines rely on ontology labels rather than IDs? That is not ideal. We've had that discussion previously with the Single Cell Expression Atlas group and I now suspect that we never reached a resolution... I'll link the previous ticket here when I find it.

paolaroncaglia commented 2 years ago

Here are two tickets partly related to the discussion: https://github.com/EBISPOT/efo/issues/934 https://github.com/EBISPOT/efo/issues/959

paolaroncaglia commented 2 years ago

And here's a link to the previous discussion: https://github.com/obophenotype/cell-ontology/issues/792#issuecomment-763710300

paolaroncaglia commented 2 years ago

Relevant exchange between me and Nancy from the SCEA: "To avoid blocks, I can assume that the SCA pipeline issue will be addressed and that any necessary change in term labels will be acceptable." "yes, please go ahead and we will deal with any issues further down the line." @anjaf please let us know if this is true for AE too or not, thank you. It's difficult to commit to never change ontology labels, especially if we don't have a clear and easy way of tagging EFO terms used by AE. I think we have a list somewhere but I'm not sure it's maintained.

anjaf commented 2 years ago

Hi @paolaroncaglia ,

Do AE's tools and pipelines rely on ontology labels rather than IDs? That is not ideal.

Unfortunately, yes. We use the term labels in our metadata files without the ontology ID for the ArrayExpress experiment type. I agree this is less than ideal. But ArrayExpress is quite an old legacy system and we have no funding to make such changes at the moment. If the term labels change, I'm not sure if we can make this backwards compatible. We haven't had that situation yet. It would break a few things like experiment type searches with ontology expansion in ArrayExpress: https://www.ebi.ac.uk/arrayexpress/search.html?query=%22RNA-seq+of+coding+RNA+from+single+cells%22

The full list of experiment type terms we are using in Annotare/ArrayExpress is here: https://www.ebi.ac.uk/fg/annotare/help/experiment_types.html

We've had that discussion previously with the Single Cell Expression Atlas group and I now suspect that we never reached a resolution...

I'm afraid these are independent of each other, ArrayExpress curation and pipelines are not directly connected with the Single Cell Expression Atlas work.

ghost commented 2 years ago

... should probably be hyphenated by English grammar rules. But I think it goes against general EFO style that is using predominantly spaces. Hence, I would be suggesting to keep everything "single cell" space-separated for consistency.

@anjaf, thank you for the additional feedback. Is there a style guide or other reference for EFO that details the "general EFO style"? This type of reference would be helpful for editors. I did a quick check and do not see this particular issue addressed in OBO Foundry naming conventions. It is mentioned to use spaces to separate words, but one may argue that a compound word is one word.

@zoependlington, @dosumis, do you have any input on hyphenating compound words in labels vs separating the words with spaces? E.g., 'single-cell RNA sequencing' vs 'single cell RNA sequencing'

zoependlington commented 2 years ago

@bvarner-ebi We tend to follow OBO naming conventions as far as possible, so there isn't any specific EFO rule regarding this. Typically, I think we go for spaces, but there are a few terms in EFO with a hyphen (e.g. EFO_0030053). I think as long as there is a non-hyphenated version as a synonym then all should be well. Granted, @paolaroncaglia has worked more on the sequencing branch than I have so she may have a different opinion.

paolaroncaglia commented 2 years ago

@anjaf (and @bvarner-ebi , see action item below please) hi, thank you for your feedback:

Do AE's tools and pipelines rely on ontology labels rather than IDs? That is not ideal.

Unfortunately, yes. We use the term labels in our metadata files without the ontology ID for the ArrayExpress experiment type. I agree this is less than ideal. But ArrayExpress is quite an old legacy system and we have no funding to make such changes at the moment. If the term labels change, I'm not sure if we can make this backwards compatible. We haven't had that situation yet. It would break a few things like experiment type searches with ontology expansion in ArrayExpress: https://www.ebi.ac.uk/arrayexpress/search.html?query=%22RNA-seq+of+coding+RNA+from+single+cells%22

The full list of experiment type terms we are using in Annotare/ArrayExpress is here: https://www.ebi.ac.uk/fg/annotare/help/experiment_types.html

Ok, so we can assume that, for any term not in that list, it is ok to change the label if needed. Note, we tend to only change labels if they're incorrect, or to ensure consistency when reasonable. @bvarner-ebi , as this ticket is now assigned to you, could you please

paolaroncaglia commented 2 years ago

@anjaf , also, are there other EFO terms, other than the ones in the experiment type list, that Array Express refers to by labels rather than ID? Zoë and I kept a longer list of Annotare terms, but I don't think I can access that doc anymore - and it'd be good to have an updated/confirmed list please, for any potential EFO editor. Thank you.

paolaroncaglia commented 2 years ago

@zoependlington @bvarner-ebi

@bvarner-ebi We tend to follow OBO naming conventions as far as possible, so there isn't any specific EFO rule regarding this. Typically, I think we go for spaces, but there are a few terms in EFO with a hyphen (e.g. EFO_0030053). I think as long as there is a non-hyphenated version as a synonym then all should be well. Granted, @paolaroncaglia has worked more on the sequencing branch than I have so she may have a different opinion.

No objection there, thank you.

anjaf commented 2 years ago

Thanks for double-checking the "assay branch" for us! That is the only critical bit where we rely on the labels. The other terms are less critical and won't break any pipelines if the label changes. For those, Annotare uses the ontology IDs, rather than labels, so we'd only have an issue if terms get deprecated. But even then, it's fine as long as we're kept in the loop and can make the changes in the pipeline. I also remember that we had a "Annotare EFO check list" but also not sure what happened to that. This file from Annotare's code contains the list of EFO accessions that are currently hard-coded on our end: https://github.com/arrayexpress/annotare2/blob/master/app/webapp/src/main/resources/Annotare-default.properties

I think for anything Expression Atlas curation related, label changes are fine too. These will get updated via the Zooma mappings.

ghost commented 2 years ago

[ ] Double-check that all labels in https://www.ebi.ac.uk/fg/annotare/help/experiment_types.html are unchanged in the latest version of EFO, just in case? Thanks in advance.

and

This file from Annotare's code contains the list of EFO accessions that are currently hard-coded on our end: https://github.com/arrayexpress/annotare2/blob/master/app/webapp/src/main/resources/Annotare-default.properties

For clarity, should I be checking both lists noted above for changed labels?

paolaroncaglia commented 2 years ago

@bvarner-ebi

[ ] Double-check that all labels in https://www.ebi.ac.uk/fg/annotare/help/experiment_types.html are unchanged in the latest version of EFO, just in case? Thanks in advance.

and

This file from Annotare's code contains the list of EFO accessions that are currently hard-coded on our end: https://github.com/arrayexpress/annotare2/blob/master/app/webapp/src/main/resources/Annotare-default.properties

For clarity, should I be checking both lists noted above for changed labels?

No, just the top one (https://www.ebi.ac.uk/fg/annotare/help/experiment_types.html). Thanks for checking.

ghost commented 2 years ago

@paolaroncaglia, @anjaf,

[x] Double-check that all labels in https://www.ebi.ac.uk/fg/annotare/help/experiment_types.html are unchanged in the latest version of EFO, just in case

I spotted one difference: EFO_0005517, which is 'RIP-chip by array' in the list while the label is 'RIP-Chip by array' in efo. Is letter case an issue here: 'c' vs 'C'?

I also noted inconsistency with the use of hyphens vs spaces in the labels, e.g., 'non coding' vs 'high-throughput'. 'single nucleus RNA sequencing' has a space between single and nucleus.

Do we have consensus on whether or not to keep the label as 'single-cell RNA sequencing' (with the hyphen)? It does not appear on the Annotare list.

So, is the current plan as follows?

[ ] Keep 'single-cell RNA sequencing' as the label and add 'single cell RNA sequencing' as an exact synonym. Do not change hyphens to spaces or vice versa on any terms for now to avoid unintended downstream effects in ArrayExpress.

[ ] Set the following to be subclasses of 'single-cell RNA sequencing': 'full length single cell RNA sequencing' 'RNA-seq of coding RNA from single cells' 'RNA-seq of non coding RNA from single cells' 'tag based single cell RNA sequencing'

paolaroncaglia commented 2 years ago

@bvarner-ebi @anjaf

[x] Double-check that all labels in https://www.ebi.ac.uk/fg/annotare/help/experiment_types.html are unchanged in the latest version of EFO, just in case I spotted one difference: EFO_0005517, which is 'RIP-chip by array' in the list while the label is 'RIP-Chip by array' in efo. Is letter case an issue here: 'c' vs 'C'?

@anjaf please?

I also noted inconsistency with the use of hyphens vs spaces in the labels, e.g., 'non coding' vs 'high-throughput'. 'single nucleus RNA sequencing' has a space between single and nucleus.

Yes, a typical issue when an ontology is edited by multiple pairs of hands and when the literature/common usage themselves do not use consistent wording. That's acceptable. I'd leave everything as is, it'd be hard to reach a wide consensus.

Do we have consensus on whether or not to keep the label as 'single-cell RNA sequencing' (with the hyphen)? It does not appear on the Annotare list.

In this case, I'd leave 'single-cell RNA sequencing' (it doesn't damage any pipeline and you don't have to un-do the edit).

So, is the current plan as follows? [ ] Keep 'single-cell RNA sequencing' as the label and add 'single cell RNA sequencing' as an exact synonym. Do not change hyphens to spaces or vice versa on any terms for now to avoid unintended downstream effects in ArrayExpress.

I agree

[ ] Set the following to be subclasses of 'single-cell RNA sequencing': 'full length since cell RNA sequencing' 'RNA-seq of coding RNA from single cells' 'RNA-seq of non coding RNA from single cells' 'single-cell RNA sequencing' 'tag based single cell RNA sequencing'

I agree, Anja confirmed that the above rearrangement wouldn't cause issues to them, as all terms would still be descendants of RNA assay. Thank you.

anjaf commented 2 years ago

I spotted one difference: EFO_0005517, which is 'RIP-chip by array' in the list while the label is 'RIP-Chip by array' in efo. Is letter case an issue here: 'c' vs 'C'?

Well spotted! We're indeed using the lower case version. Probably different casing doesn't cause issues in our pipeline. https://www.ebi.ac.uk/arrayexpress/search.html?query=%22RIP-Chip+by+array%22+ Was this always the case? Then the difference didn't have any impact on our pipelines.

ghost commented 2 years ago

I spotted one difference: EFO_0005517, which is 'RIP-chip by array' in the list while the label is 'RIP-Chip by array' in efo. Is letter case an issue here: 'c' vs 'C'?

Well spotted! We're indeed using the lower case version. Probably different casing doesn't cause issues in our pipeline. https://www.ebi.ac.uk/arrayexpress/search.html?query=%22RIP-Chip+by+array%22+ Was this always the case? Then the difference didn't have any impact on our pipelines.

It seems like it was written with 'C' at least since Oct 2021: https://github.com/EBISPOT/efo/issues/1296

I could not find the original term request.

ghost commented 2 years ago

@anjaf, @paolaroncaglia, thank you for all the feedback on this ticket. The changes made are detailed in PR #1483.

paolaroncaglia commented 2 years ago

@bvarner-ebi @anjaf

I spotted one difference: EFO_0005517, which is 'RIP-chip by array' in the list while the label is 'RIP-Chip by array' in efo. Is letter case an issue here: 'c' vs 'C'?

Well spotted! We're indeed using the lower case version. Probably different casing doesn't cause issues in our pipeline. https://www.ebi.ac.uk/arrayexpress/search.html?query=%22RIP-Chip+by+array%22+ Was this always the case? Then the difference didn't have any impact on our pipelines.

It seems like it was written with 'C' at least since Oct 2021: #1296

I could not find the original term request.

EFO trivia: the term was created by Drashtti, who left EBI in 2015. It's possible that the label has stayed that way since then.