comunica / comunica-feature-link-traversal

📬 Comunica packages for link traversal-based query execution
Other
8 stars 11 forks source link

Slash separated ontologies cause unneeded traversals when used as source #61

Open Maximvdw opened 2 years ago

Maximvdw commented 2 years ago

Issue type:


Description:

This issue is for fixing performance issues with slash separated ontologies. Lets say you are using X subjects from the same ontology http://example.com/myontology/A, http://example.com/myontology/B. Currently they are treated as individual datasets and will be traversed individually (as they should). In a normal linked data front-end this would work fine and only fetch these concepts rather than a large dataset that might contain unneeded information.

In some use cases you might be using a lot of concepts from the same ontology, in which case one request to http://example.com/myontology/ would be preferable.

When putting this ontology in sources, I would expect only one request to be make. However, it seems Comunica will still try to fetch the subjects individually creating individual requests for every subject in http://example.com/myontology/.

I think it is similar to the 'similarity' prioritisation in https://github.com/comunica/comunica-feature-link-traversal/issues/51 , however I was not certain it is the exact issue that appears here.

Try it out https://comunica.github.io/comunica-feature-link-traversal-web-clients/builds/default/ Use the following source (http): http://qudt.org/vocab/unit/ Enable the proxy (also tested it without proxy): https://proxy.linkeddatafragments.org/ Make sure it is HTTPS and not the default HTTP

Test query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX qudt: <http://qudt.org/schema/qudt/>
SELECT ?unitName WHERE {
    ?unit a qudt:Unit ;
          rdfs:label ?unitName .
}

For each concept that is available in the source, it will still create a request to the individual pages (which in the case for this ontology is a lot)


Environment:

I am using the default config along with "@comunica/query-sparql-link-traversal-solid": "0.0.2-alpha.4.0",

github-actions[bot] commented 2 years ago

Thanks for reporting!

Maximvdw commented 2 years ago

Hmm I am not sure if it is intended that actor-rdf-resolve-hypermedia-links-traverse-prune-shapetrees would solve this issue with a fictional new ShapeTree(undefined, undefined, 'http://qudt.org/unit/{id}')?

rubensworks commented 2 years ago

Thanks for the issue. This is a very interesting case, which I hadn't considered before.

So the problem here is that too many requests are being done for this query. The link traversal algorithm will do lookups for each seperate unit document, even though all required information is actually already present in the initial source. So we need a mechanism to indicate this fact somehow.

Shapetrees may indeed be a possible solution for this (perhaps using some trickery with cardinalities in shapes), but I'm not sure. In any case, the current shapetrees implementation is incomplete, so it definitely can not be used as-is. I'll report here once I've made some progress on the shapetrees implementation, and when I think it might be helpful here.

In the meantime, content policies may also do the trick, as it should be able to indicate specifically what links can be followed. But this is also still very experimental.

rubensworks commented 1 year ago

This problem was also mentioned by @jeswr in #84 for the FOAF vocabulary.

jeswr commented 1 year ago

even though all required information is actually already present in the initial source. So we need a mechanism to indicate this fact somehow.

One way of doing this is to make use of rdfs:isDefinedBy. In particular, when doing link traversal, all incoming patterns of the form ?s rdfs:isDefinedBy ?o should be stored in a lookup table or in-memory store, so that before a link is added to the queue from link traversal we can first see if it is in the isDefinedBy lookup table and that the document that it isDefinedBy has already been dereferenced.

This would indeed solve the case qudt above which has terms defined as follows:

<http://qudt.org/vocab/unit/AMD>
  a <http://qudt.org/schema/qudt/CurrencyUnit> ;
  a <http://qudt.org/schema/qudt/Unit> ;
  <http://purl.org/dc/terms/description> "Armenia"^^rdf:HTML ;
  <http://qudt.org/schema/qudt/currencyExponent> 0 ;
  <http://qudt.org/schema/qudt/dbpediaMatch> "http://dbpedia.org/resource/Armenian_dram"^^xsd:anyURI ;
  <http://qudt.org/schema/qudt/hasDimensionVector> <http://qudt.org/vocab/dimensionvector/A0E0L0I0M0H0T0D1> ;
  <http://qudt.org/schema/qudt/hasQuantityKind> <http://qudt.org/vocab/quantitykind/Currency> ;
  <http://qudt.org/schema/qudt/informativeReference> "http://en.wikipedia.org/wiki/Armenian_dram?oldid=492709723"^^xsd:anyURI ;
  rdfs:isDefinedBy <http://qudt.org/2.1/vocab/unit> ;
  rdfs:isDefinedBy <http://qudt.org/vocab/unit> ;
  rdfs:label "Armenian Dram"@en ;
.

Note in order for this to work properly all links the responseURL should also be added to the set of already dereferenced documents (though maybe this is the job of the http cache?) and ideally one would also trackRedirects if using a library with an API like follow-redirects to further optimise this process.

cc @pmcb55