comunica / comunica-feature-link-traversal

📬 Comunica packages for link traversal-based query execution
Other
8 stars 11 forks source link

Solid-default link traversal takes exponential time to execute SPARQL #60

Closed phochste closed 3 months ago

phochste commented 2 years ago

Issue type:


Description:

The solid-default configuration of the Comunica Link Traversal client takes exponential time to execute queries on a LDP container with many resources.

In my example Pod I have LDN inboxes that contain hundreds to thousands of JSON-LD resources. Every JSON-LD resource has the same structure: an object key which contains a subject, predicate and object key. E.g.

  ...
  "object": {
    "id": "e179deef-c575-45ba-8fcf-2b2fa6809311",
    "type": "Relationship",
    "relationship": "http://www.scholix.org/References",
    "subject": "https://doi.org/10.3390/en10111697",
    "object": "https://data.mendeley.com/datasets/mcgc3636xr"
  },
 ...

I would like to have a list of all such keys over all resources in an LDP container. The query I use is:

PREFIX as: <https://www.w3.org/ns/activitystreams#>

SELECT 
 DISTINCT ?subject ?pred ?object
WHERE {
  ?id a as:Announce ;
          as:object ?x .
  ?x as:relationship ?pred ;
          as:subject ?subject ;
          as:object  ?object .
}

I have 4 example LDP containers:

Executing the SPARQL on resource 209 takes 17.2 seconds and has 209 results.

Executing the SPARQL on resource 402 is after 1500 seconds still running (170 results so far). The first results appeared after 250 seconds.


Environment:

Comunica version: 2.1.0

On Chrome 100.0.4896.127

Crash log:

github-actions[bot] commented 2 years ago

Thanks for reporting!

phochste commented 2 years ago

On the 402 resource I see up to 90s network activity in the console (805 requests). After 90s no new network activity. No errors in the console. In previous experiment I saw setTimeout error popping up.

rubensworks commented 2 years ago

I'm not surprised about this :-) I expect the execution time to increase, the more triple patterns occur within your query.

This is (most likely) due to the zero-knowledge query planner that we're using for link traversal, which produces non-optimal query plans. (it's the only thing that exists for traversal atm, so it's the best we got) Ideally, we'd need an adaptive query planner that re-orders join entries based on whatever comes in as intermediary results. Related to #45 and #48.

rubensworks commented 1 year ago

@phochste Could you make those containers public (so I can test), or test yourself again to see if the issue is any better (I don't expect it to be fully resolved yet, but probably better).

phochste commented 1 year ago

@rubensworks the demo repositories above have been made world readable again

rubensworks commented 1 year ago

I don't seem to be getting any results anymore. Perhaps the underlying data changed?

phochste commented 1 year ago

@rubensworks No the underlying data did not change. But I see there is an error in the JSON-LD data. I had

  "actor" : {
      "type" : "OPENAIRE",
      "id" : "https://scholexplorer.openaire.eu/#about",
      "name" : "OPENAIRE Scholexplorer"
   }

This is $.actor.type is not valid and for some reason somewhere in the pipeline between Solid CSS and Comunica it generates an illegal triple.

I'm changing the demo documents right now to create a valid $.actor.type.

rubensworks commented 3 months ago

I'm going to close this issue due to non-reproducibility. Happy to re-open if needed.