frink-okn / FRINKIssues

0 stars 1 forks source link

Design Query Strategy #11

Open cbizon opened 3 months ago

cbizon commented 3 months ago

We want to implement transparent (or at least translucent) querying. How do we handle that at an identifier level?

cbizon commented 2 months ago

Here is a strawman proposal:

Leave the graphs in their original formats.

Create an equivalence graph as a way to manage equivalent identifiers and integrate it into sparql queries. To start, this will be explicit, but will move towards being implicit.

So where does the equivalence graph come from? I think we should have multiple ones:

  1. A graph based on hard work. This is sitting down with individual graphs/teams to solve e.g. the Person problem or the Place problem or building various whatever to wikidata identifiers.
  2. A graph based on semantic embeddings of labels. I don't expect this to be all that good, but it's probably not all that bad either, and it's easy.
  3. A graph based on graph embeddings. The technical issue here is being able to create the graph embeddings in a federated way. But it should improve on number 2.

Then we can swap these in and out even at query time.

For the embedding based ones, another backend approach would be to not make these graphs but keep vector dbs around, but even so it'd be good to lay a sparql interface over them. (RDF* based to handle a variable NN threshold?)

mahir256 commented 1 week ago

I believe that option 1 will be ultimately better across graphs in this case, though it need not be built entirely manually for all graphs.

For some graphs, particularly in the biology and environment groups, the uses of identifiers from other external sources should make mapping to Wikidata entities easier when those same identifiers are also present (subject to modifications to the Wikidata dump noted below).

For other graphs, particularly in the justice and technology groups, while such external identifiers may not be present, the substitution of a small number of custom entities (such as for frequently occurring jurisdictions and people) with Wikidata identifiers can make accessing many other entities within those graphs easier. There are also some entities within those graphs that may be worth introducing to Wikidata specifically for advancing this access, such as hardware types and vendors from Secure Chain, law firms from SCALES, industry types from SAWGraph, or manufacturing capabilities from SUDOKN.

The Wikidata dump does have mappings from item IDs to IRIs encountered in RDF that represent certain identifiers, but the links do not all use the same RDF predicate: for an item like Q1192302, the link to UniProt appears as "wd:Q1192302 wdtn:P685 http://purl.uniprot.org/taxonomy/38891", while on Q14911732 the link to UniProt appears as "wd:Q14911732 wdtn:P351 http://purl.uniprot.org/geneid/1017". The predicates beginning with "wdtn:" would thus need to be completely changed to "owl:sameAs" across the dump.

Even if those graphs with identifiers do not expose them using the same IRI schemes or with "owl:sameAs", these could still be automatically transformed appropriately on a graph-by-graph basis.