frostyfan109 / tranql

A Translator Query Language
https://researchsoftwareinstitute.github.io/data-translator/apps/tranql
MIT License
0 stars 1 forks source link

Nodes of equivalent identifiers are currently treated as separate nodes. #40

Closed frostyfan109 closed 5 years ago

frostyfan109 commented 5 years ago

This should be fairly easy to fix because we already have the functionality in resolve_name, but time complexity, which arises from just how much we can/will need to query the Bionames API, is the issue. The time complexity arises from the nature of querying an API an excessive amount of times. To accomplish this fix, we would need to query Bionames for every node from every response. This would allow us to check if any nodes in the current knowledge graph share any equivalent identifiers with any other nodes. This would all take place in merge_results.

However, at the moment, some APIs already return equivalent identifiers, so this makes it semi-feasible at the moment. We only need to query bionames for those that do not already have it.

It is also a possibility that we add something like a cache to execute_plan. This would allow us to cache the bionames queries we have already made and avoid having to request them again.

frostyfan109 commented 5 years ago

This has now be added. It does not yet use the caching idea. However, once Bionames is queried for a given node object, that node object will be given the identifiers as the persistent property equivalent_identifiers, meaning that Bionames will not have to be queried again for that specific node. Nevertheless, it would still be advantageous to somehow build a map of equivalent identifiers, which would mean that if any of those identifiers should come up again throughout the query, querying Bionames would not be required another time. Additionally, perhaps this map could remain persistent throughout TranQL instances (i.e. stored in a file).

frostyfan109 commented 5 years ago

Workflow 5 v3 simply doesn't work at the moment due to the sheer amount of requests that Bionames receives. The query crashes due to flooding the Bionames API with requests. Therefore, implementing a cache map is important.

frostyfan109 commented 5 years ago

For the time being, using Bionames in merge_results has been sandboxed and disabled. It will still use the method of checking a node's equivalent_identifiers, but it will not actively look up these identifiers using Bionames.