comunica / comunica-feature-link-traversal

📬 Comunica packages for link traversal-based query execution
Other
8 stars 11 forks source link

Handling of the root node types of TREE documents and implementation of a loose mode for URL/subject inconsistencies in TREE documents #107

Closed constraintAutomaton closed 1 year ago

constraintAutomaton commented 1 year ago

Purpose

Two problems have been solved in this PR.

The first problem is related to the current state of the publication of TREE documents on the web. We encounter multiple documents for example in this website https://treecg.github.io/TREE-LDES-visualizer/ , where the URL of the pages doesn't match the subject of the document, hence implicitly we should assume an equivalence between the URL and the subject of the document, which is not a property of the RDF specification. This has the effect that all the relations are ignored by the current state of affairs.

The second problem is that in the current implementation we don't handle changes of subject related to the type of root node we encounter, which in the current implementation produce the same issue related to the relations as the first problem.

Resolution

This PR provide a new context flag @comunica/actor-extract-links-tree:strictTraversal, where when deactivated, the engine ignores the relationship between the URL and the subject of the page. The PR, also makes the engine handle the changes of subject derived from the types of root nodes.

E2E test

The script below has been used for e2e testing of the changes. The context flag variable KeysExtractLinksTree.strictTraversal, can be switched to validate it's effect on the traversal and of course other data sources can be used for further validations.

const communica = require("@comunica/query-sparql-link-traversal");
const log = require("@comunica/logger-pretty");
const KeysExtractLinksTree = require('@comunica/context-entries-link-traversal').KeysExtractLinksTree;

new communica.QueryEngineFactory().create({ configPath: './engines/config-query-sparql-link-traversal/config/config-tree.json' }).then(
  (engine) => {
    engine.queryBindings(`
  SELECT * WHERE {
    ?s <https://w3id.org/tree#node> ?o
  }`, {
      sources: ['https://treecg.github.io/demo_data/cht.ttl'],
      [KeysExtractLinksTree.strictTraversal.name]: false,
      lenient: true,
      log: new log.LoggerPretty({ level: 'trace' }),
    }).then((bindingsStream) => {
      bindingsStream.on('data', (binding) => {
        console.log(binding.toString());
      });

    });
  }
);

close #89

constraintAutomaton commented 1 year ago

Ah, why can't everyone just follow the spec sweat_smile

In any case, such a mode is useful indeed. The only thing I'm wondering though, if is we really want it as an actor parameter, or if a context option might be preferrable here?

I think placing it in the context is better indeed, as it is easier to configure for the user IMO.

constraintAutomaton commented 1 year ago

Looks great!

Just some minor nits regarding the context key.

Another thing I just noticed. Apparently the actor is named actor-extract-links-extract-tree. Shall we rename it to actor-extract-links-tree? No need to have extract in there twice. (not super urgent, can be later in a separate PR, will probably mess up your other PR though, so maybe after that one is merged)

Good point on the naming... I will make another PR after that to fix it.