biopragmatics / bioregistry

📮 An integrative registry of biological databases, ontologies, and nomenclatures.
https://bioregistry.io
MIT License
120 stars 52 forks source link

Updated to OMIM Entry #497

Open fschiettecatte opened 2 years ago

fschiettecatte commented 2 years ago

Prefix

mim

Explanation

Hi

We (OMIM) wanted to make the following updates to our information:

We want the prefix to be ‘mim’, and the alternatives to be ‘MIM’, ‘omim’ and ‘OMIM’.

The patterns should be:

    ^[1-6]\d{5}$            # MIM number
    ^PS[1-6]\d{5}$          # PS number

Local unique identifiers are MIM numbers and PS numbers.

Patterns for the CURIEs would be:

    ^mim:[1-6]\d{5}$            # MIM number
    ^mim:PS[1-6]\d{5}$          # PS number

Please let me know if you need this in a different format, or want any additional information.

Thanks

Contributor ORCID

0000-0001-7422-9455

cthoyt commented 2 years ago

Hi @fschiettecatte, thanks for being patient while waiting for a reply. There's quite a bit to unpack here so I'll try and address it one point at a time.

We want the prefix to be ‘mim’, and the alternatives to be ‘MIM’, ‘omim’ and ‘OMIM’.

As a bit of background, one of the main purposes of the Bioregistry was to help reconcile the highly heterogeneous CURIEs in the wild both on the syntactic and semantic level. In many cases, this meant reconciling differences in the way that first-party documentation describes how to resource identifiers and the way that they're actually referenced in the wild. In the case of OMIM, we found that most resources refer to its entries using either OMIM: or omim: such as in the Disease Ontology, Monarch Disease Ontology (MONDO), HGNC, Chemotoxicogenomics Database (CTD), and many more. I'm not sure if I ever recall seeing mim used before, so maybe you can give a bit of background and motivation on why you think this would be a good change and some examples of resources that are using CURIEs with mim?

Making this kind of change could also have some drawbacks:

  1. The Bioregistry would not be best reflecting the common usage of this resource (most important)
  2. Users looking for OMIM might not be able to find it anymore
  3. Resources standardized to Bioregistry would have to update (costly, requires extra effort from strained people)

The patterns should be:

    ^[1-6]\d{5}$          # MIM number
    ^PS[1-6]\d{5}$            # PS number

We've also been having a lot of discussions on what constitutes a semantic space. The best rule of thumb for when two semantic spaces should get different prefixes we have are:

  1. Different syntax for identifiers (e.g., as you described above, there are two different patterns)
  2. Different kinds of things represented by identifiers (e.g., disease/phenotype versus phenotypic series)
  3. Different URI for resolving identifier (e.g., https://omim.org/entry/603903 vs. https://omim.org/phenotypicSeries/PS214100)

Since OMIM and OMIM Phenotypic Series have all three of these distinctions, most resources like MONDO split these into two prefixes: omim and omim.ps and thus the Bioregistry has followed suit. Further, splitting these has the additional benefit that no custom string processing is needed to tell what kind of thing a given CURIE is, since the prefix gives the type information.

Can you comment on the benefits of combining these together that might outweigh the previous approaches the community has taken?


More generally, the Bioregistry review team @biopragmatics/bioregistry-reviewers (@callahantiff @lubianat @megbalk @bgyori ) is still trying to grapple with how to respond to these kinds of requests where there's a potential conflict between resource owners and community usage. What do you think about this? Also I'd like to hear from from the Bioregistry team and any stakeholders in representing content from OMIM (e.g., OBO people)


P.S. the last time I looked carefully into OMIM I was frustrated that I could not download a full list of phenotypic series and their associated labels, constituent OMIM identifiers. Do you know if that's changed?

PPS fwiw there was a slightly more heated debate on the same topic on the OBO Foundry slack here and the response from @matentzn here but these conversations are now lost to history since slack has limited its free tier

matentzn commented 2 years ago

Across Monarch (Mondo, HPO, etc), and Biolink universes we use OMIM and OMIMPS as separate identifiers, but if we want to launch a concerted effort to fix these prefixes across the board, the first step would be to actually define proper PURLs for OMIM. A PURL is a persistent resolvable URL which refers to an OMIM term. The keyword here is "persistent" - I.E. this should work even if the OMIM team decides to move the URL to a different domain in the future (say, omim.ai).

Is a proper OMIM PURL something that the OMIM team would contemplate?

callahantiff commented 2 years ago

That's what I was going to note as well @matentzn -- that one strategy would be to do what the OBOs do.

fschiettecatte commented 2 years ago

MIM: has been in use since 2013 and possibly earlier by NCBI (Homo_sapiens.gene_info.gz & gene_info.gz) and other resources. The aconym MIM has been used to identify the numbers in the catalog since the late 1970's. I think it is advantagous to include both MIM: and OMIM: for MIM numbers in this specification. This would have the advantage of aligning with the published literature where MIM numbers have appeared since the 1970s. When resources have asked what label to use for our numbers, we request that "MIM" be used. This has been our written policy since 2016.

All that being said we appreciate that people have been using MIM/mim/OMIM/omim interchangeably as a prefix and they should continue to do so if they wish, we just wanted to say that we would want "MIM:" to be the canonical prefix.

fschiettecatte commented 2 years ago

We think an OMIM PURL would be a great idea, though it is highly unlikelt that OMIM will move to another domain (such as omim.ai in your example), it has been at omim.org since 2011. How would we proceed to create an OMIM PURL? And where would it be lookup-able?

Is a proper OMIM PURL something that the OMIM team would contemplate?

matentzn commented 2 years ago

A simple way to approximate a PURL scheme would be to say:

  1. You are pretty confident that the domain omim.org will exist even after humanity will go extinct (it is not really 100% clear what constitutes persistent here, I do not see automatically that purl.org is more "persistent" then "omim.org". This makes the domain part stable.
  2. You use https URIs
  3. You have one prefix scheme like https://omim.org/vocabulary/ which you configure using an htaccess file to redirect to your website. For example, you can add http rewrite rules that redirect https://omim.org/vocabulary/603903 to https://omim.org/entry/603903 on the website and https://omim.org/vocabulary/PS214100 tohttps://omim.org/phenotypicSeries/PS214100 on the website. When the website changes, https://omim.org/vocabulary/PS214100 and https://omim.org/entry/603903 stay exactly the same - only the htaccess file will change and redirect the terms to the appropriate place on the website.

So basically you already have 1 & 2. 3 is a matter of an afternoon for one of your website developers to implement. The rest is marketing!

matentzn commented 2 years ago

But before you go about this, I would ask @cmungall for advice so I don't lead you astray here..

allenbaron commented 2 years ago

I'd suggest using https://omim.org/entry/ for the pURLs because everyone is already using it and redirects would only be necessary for phenotypic series. I'm working under the assumption here that phenotypic series values (starting with 'PS') will not conflict with existing https://omim.org/entry/ pages.

Caveat: I'm not a web developer and have not set up pURLs previously.

matentzn commented 2 years ago

I agree @allenbaron I would also suggest using entry like you say, but only if that path is not tied to the website page system itself - i.e. if omim is using a Content Management System, and entry is a datatype, it is potentially subject to change (changing the name of the datatype). As long as it is understood that https://omim.org/entry/ is persistent no matter what, I agree, this redirect should be used.

fschiettecatte commented 2 years ago

We have had a few internal discussions about this and have come up with a scheme which we think will work.

We created an omim.org based PURL for MIM numbers and phenotypic series respectively:

https://omim.org/MIM:######

https://omim.org/MIM:PS######

Those will return a 302 redirection to the entry page and phenotypic series page respectively.

We added a 301 redirection to support the CURIE variations that have cropped up in the community, so the CURIE prefix could be ‘mim:’, ‘OMIM:’, ‘omim:’, ‘MIMPS:’, ‘mimps:’, ‘OMIMPS:’, ‘omimps:’, all of these will return a 301 redirection to the canonical PURL above.

The '/entry’ and the ‘/phenotypicSeries’ URLs won't change, a MIM entry has always been known as an ‘entry'.

We have also registered ‘/mim’ and ‘/omim’ at purl.org so https://purl.org/mim redirects to https://omim.org, and https://purl.org/mim/[path] redirects to https://omim.org/[path]. We don’t expect omim.org to go away in the foreseeable future, we registered the domain in 2001.

matentzn commented 2 years ago

@fschiettecatte Thank you for your work providing stable GUPRIs (Globally unique persistent and resolvable identifers) for OMIM content. I would like to offer a word of caution. I understand that from website perspectives helping users to find the right content is paramount - more syntactic variations or the PURLs make this easier. However, from a data integration perspective, this is a nightmare. We need a single way to resolve OMIM purls, not 4 - and ensure that everyone in the community that integrates OMIM content in knowledge graph applications and similar uses the same exact one.

I hereby offer my hopeful suggestion to remove all of these redirection alternatives apart from a single one, which you official promote is the OMIM purl.

fschiettecatte commented 2 years ago

@matentzn Thank you for those suggestions, I was trying to strike a balance between a hard spec and the fact that the community at large has been using various variations by providing them a redirection service.

I have removed all the redirections, so all that will be supported is are the following:

https://omim.org/MIM:######

https://omim.org/MIM:PS######

And I have updated the documentation on our site accordingly.

matentzn commented 2 years ago

@fschiettecatte This is great! We will promote this now as the new official OMIM PURL! Thank you for working with us on this matter!

cthoyt commented 2 years ago

Many of my questions in https://github.com/biopragmatics/bioregistry/issues/497#issuecomment-1219689536 weren't addressed before discussion on this thread shifted towards PURLs, so there won't be any updates on that front yet, but based on the results from the PURL talk, there is now an update at #615 that reflects the new changes and keeps backwards compatibility (mostly).

fschiettecatte commented 2 years ago

The discussion did rather veer off into PURLs and I apologize for not addressing all your questions.

From what I can tell there are two remaining questions, one has to do with Phenotypic Series, and the other is your P.S. about downloading Phenotypic Series data.

Phenotypic Series numbers always have the prefix 'PS', a simple MIM number (6 digits) refers to an Entry. The 'PS' prefix was required by NCBI when Phenotypic Series first appeared so that there was a clear differentiation between the two identifiers.

You can get a full list of all the Phenotypic Series at https://omim.org/phenotypicSeriesTitles/all, and each Phenotypic Series page (for example Noonan syndrome at https://omim.org/phenotypicSeries/PS163950) allows you to download the table in TSV/Excel format. If you have additional data needs, you can request access to Downloads (https://omim.org/downloads) and/or the API (https://omim.org/api).

Finally OMIM will have booth at ASHG 2022 and I will be there. If you are attending ASHG 2022 please feel free to drop by so we can meet in person.

cmungall commented 9 months ago

Just noting that GA4GH standards such as Phenopackets bake in use of OMIM rather than MIM. Phenopackets as an ISO standard and cannot be changed for 2 years. So usage of OMIM will likely continue for some time.

Of course, the providers of OMIM's wishes should be respected, but it's regrettable that we are in this situation. It would be useful if there was a mechanism whereby providers could provide official alternatives for prefixes. I know bioregistry allows for prefix synonyms, but this contains a lot of junk in many places, we should have a way of designating some as official.

sierra-moxon commented 8 months ago

Concurring with @cmungall - this would be a terrific feature to handle cases like: https://github.com/biopragmatics/bioregistry/issues/323 where the community historically uses one prefix, and the source would like another.

@cthoyt @callahantiff - what do you think?

matentzn commented 8 months ago

I read this issue not so much as a request by omim for everyone out there to change the prefixes of their OMIM data but as a request to document their organisational preferences. The point of having prefix synonyms is to be able to integrate different practices across a decentralised community; we know now that "mim" is the preferred prefix, but we have also documented that OMIM is a synonym, which means all the systems for standardising data (curies) can be used to integrate data from both sources.

The real technical problem we have is the fact that OMIMPS with this request is changing the ID format, from just a number to PS123. This will be I can imagine a significant technical burden in bioregistry to accommodate, eg curies.converter.standardize("OMIMPS:123") ---> "mim:PS123".