INCATools / ontology-access-kit

Ontology Access Kit: A python library and command line application for working with ontologies
https://incatools.github.io/ontology-access-kit/
Apache License 2.0
123 stars 29 forks source link

`get_label*` issues #212

Open joeflack4 opened 2 years ago

joeflack4 commented 2 years ago

Summary

I have a list of terms and I'm trying to get their labels using, but I got errors in some cases, and no label in others.

1-3: Using SqlImplementation

1. No get_label_by_uri() or get_labels_by_uris()

I appreciate the CURIE methods, and will likely use those more often; I'd imagine these URI ones are somewhere on the to-do list.

2. get_label_by_curie() errors out

Python code:

        labels = []
        for curie in list(df_i['term_id']):
            label = si.get_label_by_curie(curie)
            labels.append(label)

Short err: sqlalchemy.exc.DatabaseError: (sqlite3.DatabaseError) file is not a database Long err:

/Users/joeflack4/virtualenvs/mondo-ingest/bin/python /Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/pydevd.py --multiproc --qt-support=auto --client 127.0.0.1 --port 49437 --file /Users/joeflack4/projects/mondo-ingest/src/analysis/excluded_terms_in_mondo.py -o reports/excluded_terms_in_mondo_xrefs.tsv
Connected to pydev debugger (build 211.7628.24)
Traceback (most recent call last):
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
    self.dialect.do_execute(
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlite3.DatabaseError: file is not a database

The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1862, in _execute_context
    self._handle_dbapi_exception(
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2043, in _handle_dbapi_exception
    util.raise_(
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
    self.dialect.do_execute(
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.DatabaseError: (sqlite3.DatabaseError) file is not a database
[SQL: SELECT value FROM rdfs_label_statement WHERE subject = ?]
[parameters: ('ICD10CM:H18.1',)]
(Background on this error at: https://sqlalche.me/e/14/4xp6)

3. get_labels_for_curies() errors out & prints out a lot to stderr

Code: labels = [x for x in si.get_labels_for_curies(list(df_i['term_id']))]

Error:

 ...'UMLS:C0024588', 'SCTID:720344007', 'OMIM:254500', 'OMIM:156530', 'MESH:D010300', 'ICD10WHO:H53.31')]
(Background on this error at: https://sqlalche.me/e/14/4xp6)

It showed a lot more than this; more than my IDE's terminal could show. I couldn't see the actual stacktrace. Also looked like it printed the full list of terms in the ontology.

~### 4. UX: get_labels_for_curies() returns a generator~ ~I think the documentation shows specifically how to use it, but I wonder if this is the best design. I may just not be sure about the various benefits, but I'd definitely prefer if this defaulted to a (a) flat list, or (b) dict of CURIE -> label.~

Edit: This is a design choice.

Additional information

How I'm initializing:

from oaklib.resource import OntologyResource
from oaklib.implementations.sqldb.sql_implementation import SqlImplementation

resource = OntologyResource(slug='myOwlOntology.owl', local=True)
si = SqlImplementation(resource)

Relevant dependency versions:

oaklib==0.1.34
sssom==0.3.11
sssom-schema==0.9.4

All dependency versions: mondo-ingest-pip-freeze.txt

4. Using ProntoImplementation got no labels() when passing CURIE

Example class

    <owl:Class rdf:about="http://www.orpha.net/ORDO/Orphanet_10">
        <rdfs:subClassOf rdf:resource="http://www.orpha.net/ORDO/Orphanet_377789"/>
        <rdfs:subClassOf rdf:resource="http://www.orpha.net/ORDO/Orphanet_557493"/>
        <obo:IAO_0000115 xml:lang="en">A rare sex chromosome ... </obo:IAO_0000115>
        ...
        <rdfs:label>48,XXYY syndrome</rdfs:label>
    </owl:Class>

Code snippet

from oaklib.resource import OntologyResource
from oaklib.implementations import ProntoImplementation

ontology = SqlImplementation(OntologyResource(slug='ordo.owl', local=True))
ontology.label('ORDO:10')  # None
ontology.label('http://www.orpha.net/ORDO/Orphanet_10')  # 48,XXYY syndrome

Suggestions

  1. Allow user to pass prefix_map when initializing ontology. This may solve cases where it can't find on CURIE, because it can't expand CURIE properly.
  2. Update ProntoImplementation def type definitions, e.g. label(self, curie: CURIE) -> str: and def _entity(self, curie: CURIE, strict=False): to Union[CURIE, URI]. In my test case, URI worked successfully.
cmungall commented 2 years ago

Thanks @joeflack4! It looks like the problem here has nothing to do with get_labels, the problem is further upstream when connecting to a sqlite database.

joeflack4 commented 2 years ago

@cmungall For 2 of these sub-issues, I agree with you that it is a DB connection issue.

I just added this to the bottom of my issue. This is how I'm initializing my usage of OAK:

from oaklib.resource import OntologyResource
from oaklib.implementations.sqldb.sql_implementation import SqlImplementation

resource = OntologyResource(slug='myOwlOntology.owl', local=True)
si = SqlImplementation(resource)

I just noticed that the front page of the OAK GitHub seems to indicate that I should be passing a sqlite .db file into OAK. I hope this isn't the case. Do I need to pre-create a .db file from OWL first? I hope not. I was under the impression that OAK would do this for me.

If a .db file is required, can we have SqlImplementation raise an error if something other than a .db is passed into the local param? I can make a quick PR for that if you like.

cmungall commented 2 years ago

Thanks, this helps.

We will work on improving the documentation for initialization of SqlImplementation. Yes, the canonical way to do this is to pass in a path to a ready-made sqlite database. See SqlImplementation.

However, you are correct in that if you pass in an OWL file, it will build the sqlite database for you. There are a few caveats here.

I'm sorry, I missed the comment about iterators/generators before. This is indeed independent of the sqlite issues you are facing. The use of iterators is an intentional choice. We have a very preliminary FAQ entry about this, we will expand it.

The reason to use iterators is because OAK is designed for ontologies large and small, for endpoints local and remote, for big queries and small queries. Using an iterator allows the client to start doing useful work on initial results before all results have been computed.

There is indeed a tradeoff here, in that it requires slightly more defensive programming, and it's easy to make mistakes like len(oi.some_method(...)). If you really need to operate on the whole list, just wrap in list(...). But for most operations you can just operate on the iterator as if it were a list, e.g in for loops

I made a separate issue for this #221

joeflack4 commented 2 years ago

Ok, sorry to take up your times w/ the iterators concern. I struck-through that part of the issue.

Hmmm... no I haven't created a SQL DB in my Mondo work / from RDF or OWL yet, and I mostly have access to files in OWL, not RDF/XML. This will require a bit more work on my end. I'll converse with Nico and see how we want to use OAK in light of this.

Just adding again that I think SqlImplementation should raise an error then if it is initialized from a file with extension other than .db or .rdf (or some other means of determining this without looking at the file extension).

Thanks for your very detailed explanations.

cmungall commented 2 years ago

Just adding again that I think SqlImplementation should raise an error then if it is initialized from a file with extension other than .db or .rdf (or some other means of determining this without looking at the file extension).

I partially agree - I think the principle here is that it should raise an error if the input is something that it is unable to handle, and error reporting should be more intuitive. I made a separate issue for this: #223

However, I don't think it's good to unwaveringly assume a relationship between suffix and format/model

So while it's good to guess-with-consent based on suffix, it must be forgiving of going against convention

joeflack4 commented 2 years ago

@cmungall I have updated this issue with another problem case. To make best use of your time, please look at new suggestions at the bottom of the original post.