INCATools / ontology-access-kit

Ontology Access Kit: A python library and command line application for working with ontologies
https://incatools.github.io/ontology-access-kit/
Apache License 2.0
117 stars 28 forks source link

runoak mappings does not add valid curie_map #462

Closed matentzn closed 1 year ago

matentzn commented 1 year ago
runoak -i sqlite:obo:uberon mappings -O sssom > tmp/anatomy-mapping-uberon.tsv

Results in a lot of mappings, but no curie_map.

cmungall commented 1 year ago

spoke to @hrshdhgd about how to do this, StreamingSssomWriter can do a self.ontology_interface.prefix_map to get the map

byt remember many ontologies are not well-behaved and do not contain prefix expansions for all xrefs. But here the user can choose to pass this in as a global option

e.g.

runoak --named-prefix-map bioregistry -i sqlite:obo:uberon mappings -O sssom > tmp/anatomy-mapping-uberon.tsv

however we still don't expect this to work for many prefixes

✗ runoak --named-prefix-map bioregistry prefixes FMA MBA BIRNLEX SCTID Wikipedia neuronames
key val
FMA http://purl.obolibrary.org/obo/FMA_
SCTID   http://snomed.info/id/

I think the default should still be to emit a mapping and have an optional strict mode

hrshdhgd commented 1 year ago

As of now clean_prefix_map() is strict and errors out if prefixes used in the dataframe do exist in the curie_map . Should I add a flag strict : bool [default=True] for clean_prefix_map() in sssom-py?

Created a PR: https://github.com/mapping-commons/sssom-py/pull/353

cmungall commented 1 year ago

check with @matentzn but I think we need to allow this - we need to be able to extract messy unregistered xrefs from ontologies into sssom to analyze them in order to make progress on registering them. alternatively you can just inject into some kind of global unknown namespace

matentzn commented 1 year ago

The general thought "allow messy prefixes to be parsed" I guess is practical, but I am not sure this is a SSSOM problem - shouldn't the parser that registers the prefix guess, or randomly generates, a prefix map entry?

You could:

  1. Take the prefixes from self.ontology_interface.prefix_map where they exist
  2. Where they don't, check the sssom default context, or bioregistry directly
  3. Where neither works, just generate a randomised prefix entry BIRNLEX: http://w3id.org/oak/unknown_prefixes/birnlex

The ontology parsing process basically builds up a prefix map which can then be used to finalise the SSSOM data frame.

I do not think we should permit missing prefix declarations in the curie_map - that is too much wild west and will just result in underspecificed mapping files. I hope this makes sense, if not, lets clarify!

hrshdhgd commented 1 year ago

You could:

  1. Take the prefixes from self.ontology_interface.prefix_map where they exist
  2. Where they don't, check the sssom default context, or bioregistry directly
  3. Where neither works, just generate a randomised prefix entry BIRNLEX: http://w3id.org/oak/unknown_prefixes/birnlex

Do we do this in oaklib or sssom-py ?

matentzn commented 1 year ago

I would say oak. We already do something like that in SSSOM but I think the burden should be on the parser to try to interpret the ontology

cmungall commented 1 year ago

On Wed, Mar 15, 2023 at 3:34 AM Nico Matentzoglu @.***> wrote:

The general thought "allow messy prefixes to be parsed" I guess is practical, but I am not sure this is a SSSOM problem - shouldn't the parser that registers the prefix guess, or randomly generates, a prefix map entry?

You could:

  1. Take the prefixes from self.ontology_interface.prefix_map where they exist

Oh of course we should always do this

1.

  1. Where they don't, check the sssom default context, or bioregistry directly

I’d prefer this more explicitly under user control and to only use the semantic subset in prefix maps. Generating an unstable prefix map with cgi bin .pl expansions is worse than 3 below

1.

  1. Where neither works, just generate a randomised prefix entry BIRNLEX: http://w3id.org/oak/unknown_prefixes/birnlex

I like this

  1. http://w3id.org/oak/unknown_prefixes/birnlex

The ontology parsing process basically builds up a prefix map which can then be used to finalise the SSSOM data frame.

I do not think we should permit missing prefix declarations in the curie_map - that is too much wild west and will just result in underspecificed mapping files. I hope this makes sense, if not, lets clarify!

— Reply to this email directly, view it on GitHub https://github.com/INCATools/ontology-access-kit/issues/462#issuecomment-1469755624, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOI7XY3ESXBKBX6APKTW4GLMBANCNFSM6AAAAAAVWOXYIU . You are receiving this because you commented.Message ID: @.***>

matentzn commented 1 year ago

Ok, so:

  1. user supplied
  2. self.ontology_interface.prefix_map
  3. unknown prefix URI

Its not what we do in SSSOM toolkit for now, but I am fine with this option!

hrshdhgd commented 1 year ago

So the strict flag in clean_prefix_map() is good or no?

In terms of the bullets above, if I understand correctly:

  1. I see a --prefix option for the oaklib CLI which I assume is where the user passes the prefixes. It is not used in the mappings function at the moment. I'll need to implement this, correct?
    • If so, I'll have to pass this to writer.emit() which makes the code messy. Am I missing something here?
  2. This is already implemented as default at the current state of the code.
  3. So if there are prefixes in the dataframe absent in the curie_map after clean_prefix_mao() in sssom , I implement the unknown prefix URI in oaklib? This looks too shabby of things being distributed in my opinion.
matentzn commented 1 year ago

You are right @hrshdhgd I didnt think this through.

What do you think about adding an else clause here:

https://github.com/mapping-commons/sssom-py/pull/353/files#diff-8e0eb781f6bf2cf74b9c6a904555d8bd7f214fef45ea8e55da8527f834e600e4R177

That takes the unknown prefixes and adds them to the mapping set, using http://w3id.org/sssom/unknown_prefix/birnlex/

as a value? Just thinking out loud here

hrshdhgd commented 1 year ago

The source for my confusion is where do we need to add this functionality: oaklib or sssom-py

matentzn commented 1 year ago

Dont worry, you are not tthe only confused person - I don't know yet, that's why I am bouncing around.

If we go with my suggestion above, we could do it in sssom py. Basically in non-strict mode, we just assume we should generate "unknown prefix" IRIs. I think this makes sense, also for the obographs parser in sssom toolkit.

But it may lead to quite a few changes to sssom-py, especially in the parse CLI (and perhaps convert). One step at a time.

Would the https://github.com/mapping-commons/sssom-py/pull/353 PR solve the immediate problem?

hrshdhgd commented 1 year ago

For now yes, it solves the immediate problem. My only question there is should the default be True or False for the strict flag.

matentzn commented 1 year ago

I think it should be true by default.