geneontology / neo

noctua entity ontology
9 stars 2 forks source link

Add some minimal QC to the NEO pipeline #89

Open kltm opened 2 years ago

kltm commented 2 years ago

From the software call, we've agreed to add some minimal QC to the NEO pipeline for this project.

@pgaudet @vanaukenk Would you mind providing a handful of example genes to check for when building NEO?

kltm commented 2 years ago

Work at https://github.com/geneontology/neo/pull/90

kltm commented 2 years ago

After some testing, we are unable to create a full data environment with the restrictions from GHA: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources Pivoting, will treat this more like a local pipeline and run through Jenkins where I can.

kltm commented 2 years ago

Looking at doing tests in the form of runoak --input neo.owl info GO:0022008 but:

Alternatively, we could write a direct test script with oaklib and a test list. Probably still easier to make sure that lib is in ODK and run in there.

Stub added to pipeline.

kltm commented 2 years ago

It looks like there is action to get oaklib into odk, which would be convenient. Also, I think I can split the neo build into three steps (build, test, publish), which should help with the software compatibility.

kltm commented 2 years ago

Candidate in testing here: https://github.com/INCATools/ontology-development-kit/pull/586

pgaudet commented 2 years ago

The goal of the QC will be to test whether a number of test IDs are present at each load.

kltm commented 2 years ago

Suggestion from @cmungall to use sqlite backend.

kltm commented 2 years ago

For the tooling we want, we expect it to be added to a versioned public ODK release around June 1st (https://github.com/INCATools/ontology-development-kit/milestone/5).

kltm commented 2 years ago

Noting that runoak has been added to the ODK at v1.3.1, but does not seem to be functional. E.g.:

docker run --network host -it obolibrary/odkfull:v1.3.1 /bin/bash
root@moiraine:/tmp# runoak --input go-base.owl info GO:0022008
OpenBLAS blas_thread_init: pthread_create failed for thread 1 of 8: Operation not permitted
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 2 of 8: Operation not permitted
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 3 of 8: Operation not permitted
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 4 of 8: Operation not permitted
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 5 of 8: Operation not permitted
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 6 of 8: Operation not permitted
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 7 of 8: Operation not permitted
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
Traceback (most recent call last):
  File "/usr/local/bin/runoak", line 5, in <module>
    from oaklib.cli import main
  File "/usr/local/lib/python3.10/dist-packages/oaklib/__init__.py", line 7, in <module>
    from oaklib.interfaces import BasicOntologyInterface
  File "/usr/local/lib/python3.10/dist-packages/oaklib/interfaces/__init__.py", line 6, in <module>
    from oaklib.interfaces.mapping_provider_interface import MappingProviderInterface
  File "/usr/local/lib/python3.10/dist-packages/oaklib/interfaces/mapping_provider_interface.py", line 6, in <module>
    import sssom
  File "/usr/local/lib/python3.10/dist-packages/sssom/__init__.py", line 5, in <module>
    from .util import (  # noqa:401
  File "/usr/local/lib/python3.10/dist-packages/sssom/util.py", line 28, in <module>
    import numpy as np
  File "/usr/local/lib/python3.10/dist-packages/numpy/__init__.py", line 144, in <module>
    from . import core
  File "/usr/local/lib/python3.10/dist-packages/numpy/core/__init__.py", line 23, in <module>
    from . import multiarray
  File "/usr/local/lib/python3.10/dist-packages/numpy/core/multiarray.py", line 10, in <module>
    from . import overrides
  File "/usr/local/lib/python3.10/dist-packages/numpy/core/overrides.py", line 6, in <module>
    from numpy.core._multiarray_umath import (
KeyboardInterrupt
kltm commented 2 years ago

Hm, making progress after a little update:

Unpacking containerd.io (1.6.6-1) over (1.4.9-1) ...
Preparing to unpack .../docker-ce-cli_5%3a20.10.17~3-0~ubuntu-bionic_amd64.deb ...
Unpacking docker-ce-cli (5:20.10.17~3-0~ubuntu-bionic) over (5:20.10.8~3-0~ubuntu-bionic) ...
Preparing to unpack .../docker-ce_5%3a20.10.17~3-0~ubuntu-bionic_amd64.deb ...
Unpacking docker-ce (5:20.10.17~3-0~ubuntu-bionic) over (5:20.10.8~3-0~ubuntu-bionic) ...
kltm commented 2 years ago

@pgaudet I think we're unlikely to get to this soon: https://github.com/berkeleybop/bbops/issues/26 We're going to have to do updates as maintenance anyways; I think there is little point in holding up the closing of the project over this at this point. I'd vote to pull it from this project and closing this project out.

kltm commented 2 years ago

Trying: root@moiraine:/tmp# semsql make /tmp/neo.db Flame out with

java.lang.OutOfMemoryError: Java heap space
**** WARNING ***
Catastrophic JVM error encountered. Application not safely interrupted. Resources may be leaked. Check the logs for more details and consider overriding `Platform.reportFatal` to capture context.
make: *** [/usr/local/lib/python3.10/dist-packages/semsql/builder/build.Makefile:65: /tmp/neo-relation-graph.tsv] Error 255

Will follow up later on--looking at Makefile, not sure how to pass parameters or what might magnitude be needed/expected.

kltm commented 2 years ago

JAVA_OPTS=-Xmx12G JAVA_ARGS=-Xmx12G semsql make /tmp/neo.db works to get args through (not sure which). Was able to complete after a while, so at least 12G needed for this.

With that though, runoak --input /tmp/neo.db info GO:0022008 runs super zippy with no fuss (cheers @cmungall ). Still blocked with https://github.com/berkeleybop/bbops/issues/26 for production; will likely try with lib and script rather than trying to fix up CLI.

runoak --input sqlite:/tmp/neo.db info PR:000000001
PR:000000001 ! protein

Not always getting expected results: often not getting info or search results for things I know are in there. For example, runoak --input sqlite:/tmp/neo.db descendants GO:0098015 does not return results.

kltm commented 2 years ago

Noting that I load go-lego.owl and neo.owl into solr, so naturally can't query GO when just NEO.