[Feature Request] Investigate if for SPARQL endpoints Service Description provides usefull statistics

JervenBolleman commented 1 month ago

Description

For large public sparql endpoints such as uniprot statistics gathering takes a long time. UniProt and other endpoints provide detailed statistics with VoID, which might be useful to avoid sending out a lot of statistics gathering queries.

Preferred Solution

Gather the maximum of data from a VoID response if it is available.

Additional Context

Rewriting the discovery queries on the server side to retrieve information from the /.well-known/void graph instead.

Related Issues

Tasks

[] implement VoID/service description parser
[] determine if that contains enough data for aws graph-explorer
[] document the behavior / trust level of the void data.

[!IMPORTANT] If you are interested in working on this issue or have submitted a pull request, please leave a comment.

[!TIP] Please use a 👍 reaction to provide a +1/vote.

This helps the community and maintainers prioritize this request.

JervenBolleman commented 1 month ago

I am interested in contributing this feature.

JervenBolleman commented 1 month ago

For example the starting query

SELECT ?predicate (COUNT(?predicate) as ?count) { [] ?predicate ?object FILTER(!isLiteral(?object))} GROUP BY ?predicate"

times out at the UniProt sparql endpoint.

PREFIX void:<http://rdfs.org/ns/void#>
PREFIX void_ext:<http://ldf.fi/void-ext#>
SELECT
?predicate (SUM(?perPredicateParitionCount) AS ?count)
{
 ?predicatePartition void:property ?predicate ;
                     void:triples ?perPredicateParitionCount .
  MINUS {
    ?predicatePartition void_ext:datatypePartition ?datatype .
  }
} GROUP BY ?predicate

Gives the same general results.

kmcginnes commented 1 month ago

Interesting approach. I like the idea.

The first question that pops in my mind is, how universal is this? Can this request be used across all SPARQL endpoints?

There are certainly issues with the schema sync process that can cause timeouts. We are looking in to those. I'm going to add this approach as one of the things we try.

Feel free to create a PR for it. We love submissions 🤓

aws / graph-explorer