Hard coded CURIES in OAK code cause confusion when ontologies use different prefix maps

matentzn commented 10 months ago

Having pieces of code like https://github.com/INCATools/ontology-access-kit/blob/d139e99fe7faa109e0b71840e20140852a8267d9/src/oaklib/utilities/lexical/lexical_indexer.py#L52, and I think just searching there are a number of cases in OAK where these are occur, seems dangerous to me. @joeflack4 just uncovered a case where we passed in a oboInOwl prefix to semsql, which resulted in lexmatch no longer being able to understand that oboInOwl:hasExactSynonym (which was used in the ontology) is, in fact, the same as oio:hasExactSynonym. There are various ways to solve this problem:

No code exists where curies are defined that cant be overwritten by the user. In the ex
A standardised "oak context" exists to which all incoming information is standardised before processing, or, the other way around, OAK entities in the code are standardized (using curies.Converter.standardize()) against an incoming prefix map.
We could require that incoming semsql ontologies must be standardised against the OAK context (prefix map).

None of this is particularly easy - (3) is probably easiest, but we would have to give some tool support, like

runoak normalise-prefixes -i ont.db.

cmungall commented 10 months ago

3 - this is the way

joeflack4 commented 10 months ago

What does (3) entail?

Currently, you can pass a prefix map (currently only non-EPM bimap supported; prefixes.csv) when creating a SemSQL DB. Are we saying that this prefix map can have additional entries not already in the OAK context so long as there is no conflict (i.e. a URI prefix which is assigned in the to a different prefix than OAK has assigned)?

Couldn't we just interpret such conflicts as prefix synonyms and maybe throw a warning to the user?

matentzn commented 10 months ago

prefixes.csv is actually a "one way epm in disguise" the same prefix can be mapped to multiple URL prefixes. See my comments in #699 for what I think the best solution would be. The key issue here is not the EPM - it is that prefix assumptions are hardcoded in the code. All entities in the code should be cycled through a standard epm before being used (say, "curies.Converter.standardise("oio:hasDbXref")" or something similar. Ideally, --epm can always be passed in to all oak commands to replace the default epm, which re-serialises the built-in curies prior to usage.

INCATools / ontology-access-kit

Hard coded CURIES in OAK code cause confusion when ontologies use different prefix maps #698