biopragmatics / bioregistry

📮 An integrative registry of biological databases, ontologies, and nomenclatures.
https://bioregistry.io
MIT License
119 stars 51 forks source link

coordinate between bioregistry and prefixcommons/biocontext #19

Closed cmungall closed 2 years ago

cmungall commented 3 years ago

lol, @cthoyt and I keep independently developing similar things

https://pypi.org/project/prefixcommons/

which is the python lib for:

https://github.com/prefixcommons/biocontext

the idea is not to create another uber-registry, but to do ETL for existing registries (identifiers.org, OBO, GO) and allow people to remix and make their own prefix maps.

prefixcommons is used in linkml and see also https://biolink.github.io/biolink-model/#identifiers

I'm not sure if there is a specific action here, I don't know what the tradeoffs are between attempting to merge efforts vs lighweight coordination, but just wanted to flag for now

cc @matentzn @hsolbrig @kshefchek @jmcmurry @kltm @deepakunni3

cthoyt commented 3 years ago

Prefix commons is already queued up to get imported at https://github.com/bioregistry/bioregistry/issues/9. The rule-based semi-automated alignment keeps improving, so hopefully by the time I get to Prefix Commons (which is the biggest, besides Wikidata) then the manual effort will be minimal.

The main goal of this project is to help normalize all of the different variants of each prefix. The secondary goal of this project is to assess the overlaps between all of these resources (short story - big). The tertiary goal of this project is to suck all of the domain knowledge out of all of the biocurators who have contributed to previous resources and make a useful resource so new users can more easily navigate through bioontologies.

Please see https://bioregistry.io/summary for a bit more explanation on what's going on here, and this image (automatically updated nightly) below keeps track of the coverage of bioregistry vs. other resources.

matentzn commented 3 years ago

Cool analysis!

cthoyt commented 3 years ago

@matentzn where is the underlying data for https://prefixcommons.org/? I grabbed data from here but it's not complete. I couldn't find the full database for download anywhere on the prefixcommons github org

matentzn commented 3 years ago

Uuu, not sure. @cmungall would know better

cthoyt commented 3 years ago

Looks like it's coming from https://docs.google.com/spreadsheet/pub?key=0AmzqhEUDpIPvdFR0UFhDUTZJdnNYdnJwdHdvNVlJR1E&single=true&gid=0&output=csv and getting ingested and converted to a few formats by https://github.com/prefixcommons/data-ingest/blob/b78f5305eac1a3077be81ff38a993d51f87a83a2/code/LSR2json.php

cthoyt commented 2 years ago

With the last year of experience working on this, I'm pretty confident to say that the Bioregistry shouldn't "coordinate" with other efforts.

It's able to pull from other resources on a nightly basis, automatically align many prefixes and queue manual curation for the rest, and it is a completely open place where anyone can PR in updates for both novel information, or overrides for things that are wrong in other resources.

All of the Bioregistry is under an open license, so anyone is invited (and encouraged) to build their own pipelines to uptake parts or all of the Bioregistry into their own resource.

I think since there can be technical solutions for both directions of sync, then it doesn't make sense to get caught up on the difficult task of community organization for this purpose, also keeping in mind that previous historical efforts to do so weren't super successful either.

However, it still makes sense for the community to engage in different parts of the Bioregistry, such as its (developing) prefix policy (https://github.com/biopragmatics/bioregistry/issues/158) and associated unit tests that enforce each part of the prefix policy on a technical level.