Test the distributed meta-index before

lecoqlibre commented 2 months ago

We want to have a meta index on each instance to try to have faster distributed indexes querying.

lecoqlibre commented 2 months ago

We could have something like the following. It would tell if an instance have some users for some skills or cities. But this won't tell if these users are the same. If the different targeted instances have users for the same skills and cities this would likely be useless as we would still query the whole distributed federation.

The index, for the instance https://instance1.example.org could look like:

? a ex:SourceSelectionIndex.

:php a ex:SourceSelectionIndexRegistration;
    ex:forProperty ex:hasSkill;
    ex:forValue "php";
    ex:instancesIn <path/to/php/index>;
    ex:hasSource <https://instance1.example.org>.

:css a ex:SourceSelectionIndexRegistration;
    ex:forProperty ex:hasSkill;
    ex:forValue "css";
    ex:instancesIn <path/to/css/index>;
    ex:hasSource <https://instance1.example.org>.

:javascript a ex:SourceSelectionIndexRegistration;
    ex:forProperty ex:hasSkill;
    ex:forValue "javascript";
    ex:instancesIn <path/to/javascript/index>;
    ex:hasSource <https://instance1.example.org>.

:paris a ex:SourceSelectionIndexRegistration;
    ex:forProperty ex:hasLocation;
    ex:forValue "paris";
    ex:instancesIn <path/to/paris/index>
    ex:hasSource <https://instance1.example.org>.

The SPARQL query to find instances (and their indexes) having users with some skills or cities:

SELECT ?source, ?indexPhp, ?indexCss, ?indexJavascript, ?indexParis WHERE {
    ?hasPhp a ex:SourceSelectionIndexRegistration;
        ex:forProperty ex:hasSkill;
        ex:forValue "php";
        ex:instancesIn ?indexPhp;
        ex:hasSource ?source.

    ?hasCss a ex:SourceSelectionIndexRegistration;
        ex:forProperty ex:hasSkill;
        ex:forValue "css";
        ex:instancesIn ?indexCss;
        ex:hasSource ?source.

    ?hasJavascript a ex:SourceSelectionIndexRegistration;
        ex:forProperty ex:hasSkill;
        ex:forValue "javascript";
        ex:instancesIn ?indexJavascript;
        ex:hasSource ?source.

    ?hasCity1 a ex:SourceSelectionIndexRegistration;
        ex:forProperty ex:hasLocation;
        ex:forValue "paris";
        ex:instancesIn ?indexParis;
        ex:hasSource ?source.
}

pchampin commented 2 months ago

what's the added value of ex:hasSource? Given that the value of ex:instancesIn is an IRI (I assume!), it will also contain the source...

lecoqlibre commented 2 months ago

We discussed about it today. The ex:hasSource allows to do a join to get the results from the same instance. We can avoid it by rewriting the query to make a join based on the extraction of the source from the ex:instancesIn. So we have to choose between a more complex SPARQL query or a more verbose index.

lecoqlibre commented 1 month ago

[x] I tried the distributed meta indexes with the link traversal (https://github.com/Wimmics/solid-start/commit/ebfcedfe4e37a8b4c68804d844dbcc6d0d54aecb) and the perfs are very bad. It's completely inusable in term of elapsed time and consumed resources (CPU and memory).

I think the link traversal plays a huge role in these bad perfs especially because it dereferences every indexes on every instance (the meta indexes contain a link to all these indexes and Comunica assumes it can find some interesting triples in these too)!

So we should really use named graphs or at least split our query into several queries (to simulate named graphs waiting for Comunica to be fixed - by us).

[x] Re-try with a simulation of named graphs by executing several queries sequentially.

@FabienGandon @pchampin @balessan

balessan commented 1 month ago

Thanks @lecoqlibre for the tests, not a good news I would say !

So we should really use named graphs or at least split our query into several queries (to simulate named graphs waiting for Comunica to be fixed - by us).

Any ideas of the steps to achieve that ?

pchampin commented 1 month ago

Any ideas of the steps to achieve that ?

My suggestion to @lecoqlibre (in a mattermost discussion) is to split the query into a list of queries, where each query (apart from the last one) provides the sources against which the next query will be executed.

lecoqlibre commented 1 month ago

In order to split our query into a list of queries I first executed a query to try source selection: to know which instances have the selected skills and cities. This gives bad perfs and we are not using the link traversal. Indeed, for a single skill and a single city it took 25 sec. to complete the execution. For 3 skills and one city it was about 90 sec. (see below).

Here is the SPARQL query:

SELECT DISTINCT ?source WHERE {
  ?23 a <http://example.org#SourceSelectionIndexRegistration>;
    <http://example.org#forProperty> <http://example.org#hasSkill>;
    <http://example.org#forValue> "23";
    <http://example.org#hasSource> ?source.

  ?24 a <http://example.org#SourceSelectionIndexRegistration>;
    <http://example.org#forProperty> <http://example.org#hasSkill>;
    <http://example.org#forValue> "24";
    <http://example.org#hasSource> ?source.

  ?25 a <http://example.org#SourceSelectionIndexRegistration>;
    <http://example.org#forProperty> <http://example.org#hasSkill>;
    <http://example.org#forValue> "25";
    <http://example.org#hasSource> ?source.

  ?paris a <http://example.org#SourceSelectionIndexRegistration>;
    <http://example.org#forProperty> <http://example.org#hasLocation>;
    <http://example.org#forValue> "paris";
    <http://example.org#hasSource> ?source.
} LIMIT 100

The results of this query are:

The targeted sources are the distributed meta indexes:

http://localhost:8001/indexes/root
http://localhost:8002/indexes/root
http://localhost:8003/indexes/root
http://localhost:8004/indexes/root
http://localhost:8005/indexes/root
http://localhost:8006/indexes/root
http://localhost:8007/indexes/root
http://localhost:8008/indexes/root
http://localhost:8009/indexes/root
http://localhost:8010/indexes/root
http://localhost:8011/indexes/root
http://localhost:8012/indexes/root
http://localhost:8013/indexes/root
http://localhost:8014/indexes/root
http://localhost:8015/indexes/root
http://localhost:8016/indexes/root
http://localhost:8017/indexes/root
http://localhost:8018/indexes/root
http://localhost:8019/indexes/root
http://localhost:8020/indexes/root
http://localhost:8021/indexes/root
http://localhost:8022/indexes/root
http://localhost:8023/indexes/root
http://localhost:8024/indexes/root
http://localhost:8025/indexes/root
http://localhost:8026/indexes/root
http://localhost:8027/indexes/root
http://localhost:8028/indexes/root
http://localhost:8029/indexes/root
http://localhost:8030/indexes/root
http://localhost:8031/indexes/root
http://localhost:8032/indexes/root

lecoqlibre commented 1 month ago

So I think there is already a blocker at the level of source selection.

One hybrid approach (distributed/federated) could be to have a federated index to do source selection. This index would allow us to know which instances could potentially have valid results (for example which instances have users for the skill 23, 24 and 25 and for Paris).

Here is an example of what that federated index could look like:

@prefix : <#>.
@prefix ex: <http://example.org#>.

<> a ex:SourceSelectionIndex.

:23 a ex:SourceSelectionIndexRegistration;
    ex:forProperty ex:hasSkill;
    ex:forValue "23";
    ex:instancesIn <http://localhost:8001>, <http://localhost:8003>, <http://localhost:8013>, <http://localhost:8022>.

We could even have something more detailed with some stats like the number of matched users:

@prefix : <#>.
@prefix ex: <http://example.org#>.

<> a ex:SourceSelectionIndex.

:23 a ex:SourceSelectionIndexRegistration;
    ex:forProperty ex:hasSkill;
    ex:forValue "23";
    ex:results 
        [
            ex:instancesIn <http://localhost:8001>;
            ex:hasCount "8"
        ],
        [
            ex:instancesIn <http://localhost:8003>;
            ex:hasCount "43"
        ],
        [
            ex:instancesIn <http://localhost:8013>;
            ex:hasCount "12"
        ],
        [
            ex:instancesIn <http://localhost:8022>;
            ex:hasCount "1"
        ].

Note: I used predicates which might be changed.

@FabienGandon @pchampin @balessan

lecoqlibre commented 1 month ago

I fixed the graphical interface so now I get a new instance every 2 seconds (1 skill + 1 city) and 7 seconds (3 skills + 1 city) in the results. We should not wait for the end of a previous query to execute the next one. With streams we can start the distributed exploration as soon as we get a result. So as soon as I get a link to a selected instance from the source selection query I can launch a query on this instance to get the first users.

I will test this but I suppose that the idea of having a federated source selection index could still improve the perfs.

lecoqlibre commented 1 month ago

I succeeded (https://github.com/Wimmics/solid-start/commit/ae6b5b52a311072a99f17ae68d868d7f3c7626c5) in testing the source selection without link traversal starting from the 32 distributed meta indexes (one on each instance):

one query is used to select the relevant indexes on the distributed instances;
a second query is executed on each relevant distributed instances (using the previously selected indexes as sources).

Note: here we are testing in a distributed fashion only. The federated index to do source selection has not been tested.

I tried on 1 skill (23) + 1 city (Paris). The total execution time was about 35 sec. I got the first result at 22 sec. and the second result at 34 sec. This can be explained by the fact that the two results are coming from the 7th and the 11th instance on the list (selected instances are 1, 8, 10, 12, 14, 16, 23, 25, 27, 29, 31). The list is currently sorted by instance number.

To get the results faster we should have some mechanism to sort the instances by relevance. This way, the most relevant instances will be queried first and the results should come faster.

Also using a federated index to do source selection like proposed above can help to be faster. We can even link to specific distributed indexes directly from there to save a traversal.

Any thoughts @FabienGandon @pchampin @balessan ?

pchampin commented 1 month ago

Are you waiting for all the results of one instance before starting to query the next one?

If that's what you are doing, I believe that you could get much better performances by running the queries in parallel, and showing results as they arrive, regardless of the source.

This can be achievable by storing multiple promises in an array, and using Promise.any to get the result from the first one to succeeds, and repeating this until none of them has anything on stock.

lecoqlibre commented 1 month ago

Are you waiting for all the results of one instance before starting to query the next one?

No I'm (normally) not. Like I said, as soon as I get a source I can execute a query on this instance. The bindingsStream.on('data', callback) of the used strategy executes a new query as soon as a source is found. So several queries are normally running in parallel. You can look at the previously mentioned commit https://github.com/Wimmics/solid-start/commit/ae6b5b52a311072a99f17ae68d868d7f3c7626c5, especially the line 56 of the StrategyComunicaSourceSelection class.

This can be achievable by storing multiple promises in an array

I'm indeed using an array of promises and wait for all to resolve to end the global query using Promise.all, see line 61.

pchampin commented 1 month ago

It does indeed look like you are running all "subqueries" in parallel, which is good. Improvement needs to come from somewhere else then :) Taking advantage of the hasCount, as suggested above, looks promising.

Wimmics / solid-start

Test the distributed meta-index before #38