Transforms of `agro`, `envo`, and `gaz` fail during post-processing

Knowledge-Graph-Hub / kg-obo

A package to transform all OBO ontologies into KGX TSV format and OBO json, and put the transformed graph in KGhub

https://knowledge-graph-hub.github.io/kg-obo/getting_started.html

GNU General Public License v3.0

28 stars 2 forks source link

Transforms of `agro`, `envo`, and `gaz` fail during post-processing #186

Open caufieldjh opened 1 year ago

caufieldjh commented 1 year ago

Describe the bug

The agro transform appears to go as expected, until it hits post-processing:

Transforming agro to tsv...
[KGX][cli_utils.py][    transform_source] INFO: Processing source 'agro.json'
INFO:kg-obo:No errors in parsing ['data/agro/2021-11-05/agro.json'].
Post-processing agro...
INFO:kg-obo:Post-processing agro...
Failed to remap node IDs - could not find corresponding nodes.
Failed post-processing agro...
INFO:kg-obo:Failed post-processing agro...
WARNING:kg-obo:Failed to transform agro

To Reproduce

python run.py --bucket kg-hub-public-data --save_local --get_only agro

Expected behavior

Post-processing for this OBO should update 4 CURIEs and write out the updated nodes file.

Version

efc2324f040d8daad14ffaaaa6e71583d6258117

caufieldjh commented 1 year ago

A clue - the CURIEs to be updated are all wikidata URLs and should get the prefix WIKIDATA:, but they get the prefix WD_Entity: instead. Bioregistry knows about that alternate prefix but it isn't in the imported maps.

caufieldjh commented 1 year ago

The post-processing fails because KG-OBO finds prefixes it wants to rewrite, writes them to the update_id_maps.tsv, but then finds that the nodefile doesn't contain any of those nodes since they have been converted to WD_Entity: already.

caufieldjh commented 1 year ago

This is a conversion kgx is doing - transforming the obojson version also yields WD_Entity nodes:

kgx transform -i obojson -f tsv -o agro_test agro.json

This is true for both kgx 1.5.9 and 1.7.0.

caufieldjh commented 1 year ago

So kgx is probably using the prefixcommons Monarch map: https://github.com/prefixcommons/prefixcommons-py/blob/master/prefixcommons/registry/monarch_context.jsonld#L151

caufieldjh commented 1 year ago

Essentially we need to deactivate the prefix maps handled by the kgx prefix manager (https://kgx.readthedocs.io/en/latest/reference/prefix_manager.html).

caufieldjh commented 1 year ago

The transform of envo has a nearly identical issue.

caufieldjh commented 1 year ago

Same with gaz.

caufieldjh commented 1 year ago

xco has a potentially related issue, though with MESH.

caufieldjh commented 1 year ago

The big workaround here is to just be less stringent about incomplete mappings. Right now, if we attempt to remap 2 nodes and 2 fail, we consider the whole transform failed, but if just 1 fails, it clears. The priority should be on having a transform; there may be 10000x as many perfectly prefixed nodes in there.