VirtualFlyBrain / vfb-pipeline-dumps

Pipeline that creates dumps from the triplestore for consumption by the downstream services
Apache License 2.0
0 stars 0 forks source link

Extend obographs-solr.py to load parents and ancestors #36

Open dosumis opened 2 years ago

dosumis commented 2 years ago

STATUS: DRAFT

For the CAP project, we would like to load parents and ancestors to SOLR (storing labels and curies for each node). This needs to be configurable by relation and to allow specification of upper bounds.

obographs-solr.py is the current loader so it would be simplest to just extend this. As this is both a runner script and a collection of functions, args should be shifted to use argparse and new functionality should be driven by optional args. This will ensure that current uses of the script (e.g. in VFB) will remain unaffected.

This script already uses OBOgraphs json format to load labels and synonyms to load content to SOLR. OAK can load these data structures and has an interface that makes it easy to get lists of descendants or ancestors.

Suggested new args:

--add-ancestors {path to file of curies specifying relations to follow - default = subClassOf} --upper-bounds {path to file of curies specifying upper bounds}

For each each term in the upper bound list, generate a list of descendants. (UBD) For each term loaded, generate list of ancestors. Load the intersection of this list with UBD.

Potential concerns: Scaling Possible alternative - just use an ubergraph for queries?

dosumis commented 2 years ago

CC @matentzn - would be useful to get some comment on strategy here.

dosumis commented 2 years ago

Having review the status of OAK dev, we have decided to avoid using it for now. Alternative is to use UberGraph

You should be able to get ancestors (via subClassOf) from the UberGraph redundant graph, and direct parent classes from the non-redundant graph. Queries should be batched using VALUES for speed. Logic for setting upper bounds is the same as above.

However - I don't think this belongs in VFB as it will involve calling an external service. Code should belong to CAP.