EBISPOT / efo

Github repo for the Experimental Factor Ontology (EFO)
https://www.ebi.ac.uk/efo/
56 stars 13 forks source link

Change EFO labels so they're not capitalised #274

Closed paolaroncaglia closed 2 years ago

paolaroncaglia commented 6 years ago

Stemming from https://github.com/EBISPOT/efo/issues/261

Not a must-have but a would-be-nice and something I think we already mentioned in the past: it would be good to make all EFO labels NOT capitalised (e.g. "sialidosis" instead of "Sialidosis"), other than for disease names based on their discoverer or similar of course. That way we would be more consistent internally, and would align better with MONDO (I recall that @cmungall pushed for that format in MONDO). GO also does that. We'd still have to deal with different capitalization in labels of terms imported from other ontologies.

Regardless of changing existing labels or not, we should aim for un-capitalised for all new EFO terms that we create ourselves.

Icebox?

zoependlington commented 6 years ago
zoependlington commented 6 years ago

I ran a SPARQL query to extract all of the labels that contain non-lowercase labels, results of which can be found here: https://docs.google.com/spreadsheets/d/11GCzjO2_7V5WU5eEswqlc-VQ7Wx-e59NH0DmHK-4Hsc/edit?usp=sharing

We need to decide how to handle certain situations:

I will go through this list and highlight any that do not fall into these categories and therefore can have the labels edited to comply with the new lower-case ruling.

zoependlington commented 6 years ago

Editing guidelines have now been edited on Confluence.

cmungall commented 6 years ago

@nicolevasilevsky would be good to align this with MONDO editors guide.

I have some tools for normalizing capitalization, not quite ready for prime time..

nicolevasilevsky commented 6 years ago

I updated the MONDO editors guide, under labels: https://docs.google.com/document/d/19bp9MpCHCxbjMmbntB2e5gZNzzNlu06DnDB8xcoSXK8/edit#

paolaroncaglia commented 6 years ago

Thanks @zoependlington , @cmungall and @nicolevasilevsky ! Copying here MONDO's case rules for labels in full:

"Use lowercase, even for initial letter, except for these exceptions: proper names (for example, Epstein-Barr virus-associated mesenchymal tumor) latin names (for example, Homo sapiens) acronyms (e.g. NADPH, IgG, GM14408, HIV-associated cancer) roman numerals (e.g. type II diabetes) Human gene symbols should be capitalized Type symbols should be capitalized (e.g. “type A”) Generally arabic > roman, except for established names (e.g. cranial nerves)"

paolaroncaglia commented 6 years ago

@zoependlington and I sorted the spreadsheet https://docs.google.com/spreadsheets/d/11GCzjO2_7V5WU5eEswqlc-VQ7Wx-e59NH0DmHK-4Hsc/edit?usp=sharing by alphabetical order of labels, as that will make it quicker to see related labels and speed things up. There are >13,000 classes to go through, so I'll start from the top and Zoe from the bottom. We added columns to simplify sorting the final output.

daniwelter commented 6 years ago

@paolaroncaglia @zoependlington I have some bandwidth at the moment if you want me to take a section as well

paolaroncaglia commented 6 years ago

@daniwelter sure, thanks! Please feel free to take any section you like; perhaps highlight the section you are working on with a light colour, so Zoe and I can see that as a glance and not step on your toes? :-)

zoependlington commented 6 years ago

@daniwelter I've marked between 4422 and 8844 so we've all got three equal(ish) sections, so feel free to take that section 😁

daniwelter commented 6 years ago

Oh, haha, I just marked letters L, M and N - ok, I'll undo my changes

paolaroncaglia commented 6 years ago

@daniwelter noted some useful edit rules:

cmungall commented 6 years ago
  • for ORDO terms that are hyphenated list of symptoms, could we please make sure that there are always spaces between the dashes to distinguish them from actually hyphenated words?

How are these generally written in the literature?

Note in some cases the hyphens are omitted altogether, making long tokens, which suggests a tighter binding between words?

paolaroncaglia commented 6 years ago

@cmungall Here are a couple of examples of the labels we edited by adding spaces around dashes:

Acute infantile liver failure-multisystemic involvement syndrome acute infantile liver failure - multisystemic involvement syndrome The NIH GARD entry is called “Infantile liver failure syndrome 1” (https://rarediseases.info.nih.gov/diseases/13114/infantile-liver-failure-syndrome-1). A quick search in PubMed returns more entries for “acute infantile liver failure” than for “multisystemic involvement syndrome”. Both facts suggest that it’s fine to keep the second half of the label separated by the first via spaces and dash (not via a hyphen).

ADNP-related multiple congenital anomalies-intellectual disability-autism spectrum disorder ADNP-related multiple congenital anomalies - intellectual disability - autism spectrum disorder Searching PubMed with either returns no exact matches (as expected); best matches returned suggest that it’s fine to keep sub-labels separate, as above. In this example it’s easy to spot the conceptual difference suggested by hyphens (ADNP-related) vs. space-dash-space (intellectual disability - autism spectrum disorder). I think that by editing in this way, we are making labels conceptually more correct, without removing text strings potentially useful for searching.

Any concern please let us know :-) Thanks!

cmungall commented 6 years ago

I see, in these cases I agree with the additional spacing.

I was thinking of cases like oto-spondylo-mega-epiphyseal dysplasia (also written as otospondylomegaepiphyseal dysplasia)

paolaroncaglia commented 6 years ago

@cmungall I've been leaving labels such as "oto-spondylo-mega-epiphyseal dysplasia" as they are (i.e. I'm not adding spaces, but I'm not removing dashes either, I couldn't even pronounce "otospondylomegaepiphyseal dysplasia"!) :-) Cheers.

daniwelter commented 6 years ago

Seconded - that was the reasoning behind my suggestion. Something that ends in syndrome or some other overarching term that pulls the previous element together can be spaced or hyphenated in the standard style. It was only names that are groupings of stand-alone symptoms or conditions that I felt should have spaces around the dashes to clearly delineate this use case.

paolaroncaglia commented 6 years ago

@zoependlington , @daniwelter and I have finished going through the spreadsheet of capitalized labels (https://docs.google.com/spreadsheets/d/11GCzjO2_7V5WU5eEswqlc-VQ7Wx-e59NH0DmHK-4Hsc/edit#gid=0). In summary:

13267 labels contained at least one uppercase letter when we started; of these,

211 are obsolete terms, so no need to edit their labels; 6320 labels need editing (all suggested new labels are in the spreadsheet); 6736 labels don’t need editing.

paolaroncaglia commented 6 years ago

Leaving this to @zoependlington now, thanks!

zoependlington commented 6 years ago

Thanks @paolaroncaglia!

To Do:

cmungall commented 6 years ago

This spreadsheet is fantastic! Thanks so much for this.

I have a super-hacky script to apply these to obo format, but I think what we want is something at the OWLAPI level (ROBOT command?) or at the RDF level (SPARQL update?) that replaces literals.

paolaroncaglia commented 2 years ago

@zoependlington Following up on this ticket, I can no longer access the spreadsheet linked in https://github.com/EBISPOT/efo/issues/274#issuecomment-434608808, but 1) I suspect that many of the uppercase labels come from external resources such as HP and Orphanet 2) Now that EFO imports HP dynamically, it doesn't make much sense to change the labels to lowercase to HP terms in EFO only 3) Orphanet terms in EFO will likely be gradually replaced by Mondo 4) The work we did previously was used in Mondo to fix capitalisation in ~350 terms already 5) So the only uppercase labels left worth lowercasing would be from EFO, if we still care about doing that. The spreadsheet would tell us how many there were at the time, but it's probably out of date.

Not a priority at all, but shall we update the spreadsheet, or close?

zoependlington commented 2 years ago

I believe this may be too out of date and maybe we can revisit. But for now, I would agree with closing the ticket.