Knowledge-Graph-Hub / knowledge-graph-hub.github.io

https://kghub.org
BSD 3-Clause "New" or "Revised" License
2 stars 2 forks source link

Hang while loading schema from biolink-model #18

Closed caufieldjh closed 1 month ago

caufieldjh commented 2 years ago

All runs, Jenkins or otherwise, hang after this point:

0:39:23  + python make_kg_manifest.py --bucket kg-hub-public-data --outpath MANIFEST.yaml --maximum 10
10:39:25  Retrieving OBO metadata from https://raw.githubusercontent.com/OBOFoundry/OBOFoundry.github.io/master/registry/ontologies.yml...
10:39:28  Found credentials in environment variables.
10:39:28  Searching kg-hub-public-data...
10:40:25  Bucket kg-hub-public-data contains 137386 objects.
10:40:25  Found 528 new compressed graph files.
10:40:25  Found 2046 new uncompressed graph files.
10:40:25  Will consider only 10 files in total.
10:40:25  Will process 10 new compressed graph files.
10:40:25  Will process 0 new uncompressed graph files.
10:40:25  No updates for kg-idg.
10:40:25  Validating new builds for kg-covid-19...
10:40:25  Retrieving kg-covid-19/20200925/kg-covid-19.tar.gz...
10:40:46  Validating graph files with KGX...
10:40:47  biocontext map idot_context has illegal prefix: 2D-PAGE.PROTEIN
10:40:47  biocontext map idot_context has illegal prefix: 3DMET
10:40:47  biocontext map idot_context has illegal prefix: MMMP:BIOMAPS
10:40:49  class "organism taxon" slot "has taxonomic rank" does not reference an existing slot.  New slot was created.
10:40:53  biocontext map idot_context has illegal prefix: 2D-PAGE.PROTEIN
10:40:53  biocontext map idot_context has illegal prefix: 3DMET
10:40:53  biocontext map idot_context has illegal prefix: MMMP:BIOMAPS
10:40:53  Loading schema https://w3id.org/linkml/types from https://raw.githubusercontent.com/biolink/biolink-model/2.2.13/biolink-model.yaml
...
[an indeterminate but excessive amount of time passes, during which nothing happens]

Maybe a biolink-model update would help?

Originally posted by @caufieldjh in https://github.com/Knowledge-Graph-Hub/knowledge-graph-hub.github.io/issues/16#issuecomment-1066891226

caufieldjh commented 2 years ago

The exact point this happens is here: https://github.com/Knowledge-Graph-Hub/knowledge-graph-hub.github.io/blob/e0f680dc16b5ee89816de63b2f208e2b7301bf7a/utils/make_kg_manifest.py#L243-L254

I haven't been able to pin down exactly where kgx retrieves the schema, but there is a new version (1.5.6, vs. the 1.5.5 used here) so I'll try that first.

caufieldjh commented 2 years ago

Bumping kgx to 1.5.6 does not appear to solve this. Running kgx validate from command line in a fresh venv on a smaller graph (I tried https://kg-hub.berkeleybop.io/kg-obo/obcs/2018-02-22/obcs_kgx_tsv.tar.gz) does appear to work as expected. (Also running kgx validate locally on a larger KG seems to stall for a moment, but completes)

caufieldjh commented 2 years ago

The specific biolink-model version to use is defined in biolink-model-toolkit: https://github.com/biolink/biolink-model-toolkit/blob/master/bmt/toolkit.py

caufieldjh commented 2 years ago

This could also be an issue with KG-COVID-19. When I do this locally:

$ wget https://kg-hub.berkeleybop.io/kg-covid-19/20200925/kg-covid-19.tar.gz
...
$ kgx validate -i 'tsv' -c 'tar.gz' -o temp-test-kgcovid19 kg-covid-19.tar.gz

kgx seems to hang (i.e., it still hasn't started node validation after >10 min). This happens with the most recent version of KG-COVID-19, too.

caufieldjh commented 2 years ago

On the last successful build, it looks like kgx required about KG-IDG ~16 min between schema loading and beginning node validation:

[2022-03-02T16:04:10.369Z] Loading schema https://w3id.org/linkml/types from https://raw.githubusercontent.com/biolink/biolink-model/2.2.13/biolink-model.yaml
[2022-03-02T16:20:31.627Z] Validating nodes in graph

This is also the case when run locally as

kgx validate -i 'tsv' -c 'tar.gz' -o temp-test-kgidg KG-IDG.tar.gz

KG-IDG is smaller in size than KG-COVID-19 (205.43M vs 787.17M compressed) but not enough that I'd expect the former to take 16 min and the latter to take days.