Knowledge-Graph-Hub / universalizer

The KG-Hub Universalizer provides functions for knowledge graph cleanup and identifier normalization.
BSD 3-Clause "New" or "Revised" License
3 stars 2 forks source link

Duplication of functionality of Bioregistry #9

Open cthoyt opened 1 year ago

cthoyt commented 1 year ago

There's a huge amount of prior effort in normalizing identifiers built in to the bioregistry, including support for "preferred prefixes" which include opinionated stylization. Can you help me better understand what functionality was missing from the Bioregistry that motivated you to write very similar code?

caufieldjh commented 1 year ago

Hi @cthoyt ! This package imports curies and is, in fact, fully reliant on Bioregistry prefixes and prefixmaps for IRI->prefix conversions and vice versa.

Part of the motivation of this project is to unify methods we currently have pieces of in other packages, like kgx and individual KG assembly projects like KG-Phenio, KG-OBO, and KG-Bioportal. In the case of those latter two projects, the challenge is not only in mapping IRIs to prefixes but in ensuring that all prefixes are internally consistent across the project and that we handle as many variants as possible (these both deal with transforming input ontologies which may use a variety of different forms of the "same" IRIs).

There are several other features that make universalizer distinct:

It's also entirely possible that there's still some code overlap with curies - please let me know if you see specific areas of redundancy. I'm quite reliant on the Converter so thank you for assembling curies and keeping the Bioregistry going!

cthoyt commented 1 year ago

Part of the motivation of this project is to unify methods we currently have pieces of in other packages, like kgx and individual KG assembly projects like KG-Phenio, KG-OBO, and KG-Bioportal. In the case of those latter two projects, the challenge is not only in mapping IRIs to prefixes but in ensuring that all prefixes are internally consistent across the project and that we handle as many variants as possible (these both deal with transforming input ontologies which may use a variety of different forms of the "same" IRIs).

A lot of this functionality already exists in the Bioregistry in a fully generalizable way that enables standardization in the Bioregistry's internal standard, then mapping out to the flavor of prefixes desired in a given use case.

It loosens some of the assumptions about bijective maps from IRI to prefix in order to accommodate the variants mentioned above for projects like KG-OBO and KG-Bioportal

I think you're referring to this

https://github.com/Knowledge-Graph-Hub/universalizer/blob/0cf527c2c7f740ea2bb934adbf061efd959ba8d8/universalizer/norm.py#L200-L209

in fact we did already implement a more general data structure in curies to handle this situation.

We should plan time for a call next week so I can walk you through some of this functionality so we can join forces and develop really high quality, reusable software together, rather than doing it in isolation.