MaastrichtU-IDS / federatedQueryKG

This repository is a workplace for COST Action Hackathon event on Federated Query over Knowledge Graphs which will happen on 25-27 April in Izmir Turkey.
MIT License
11 stars 3 forks source link

HeFQUIN issue with calling mapping service #10

Closed micheldumontier closed 1 year ago

micheldumontier commented 1 year ago

Calling mapping service (i.e. https://bioregistry.io/sparql) returns empty result. Note that this custom SPARQL endpoint currently does not respond to POST requests. issue registered here: https://github.com/biopragmatics/bioregistry/issues/802

micheldumontier@FSELAP0508:~/code/external/HeFQUIN$ java -cp target/HeFQUIN-0.0.1-SNAPSHOT.jar se.liu.ida.hefquin.cli.RunQueryWithoutSrcSel --federationDescription=ExampleFederation.ttl --query=bioregistry-federation.rq --printLogicalPlan --printPhysicalPlan
> mj 
  > req[-53240192, 1714180207] ( { (filter (= ?geneLabel "AP2B1")
  (bgp
    (triple ?gene <https://w3id.org/biolink/vocab/category> <https://w3id.org/biolink/vocab/Gene>)
    (triple ?gene <http://www.w3.org/2000/01/rdf-schema#label> ?geneLabel)
  ))
 }, SPARQL endpoint at http://kg-hub-rdf.berkeleybop.io/blazegraph/sparql )
  > req[-513956468, -992590175] ( { (bgp (triple ?gene @owl:sameAs ?geneHttps)  ) }, SPARQL endpoint at https://bioregistry.io/sparql )

> FILTERBindJoin> bgpAdd[-513956468, -992590175] ( (bgp (triple ?gene @owl:sameAs ?geneHttps)  ), SPARQL endpoint at https://bioregistry.io/sparql )
  > req[-53240192, 1714180207] ( { (filter (= ?geneLabel "AP2B1")
  (bgp
    (triple ?gene <https://w3id.org/biolink/vocab/category> <https://w3id.org/biolink/vocab/Gene>)
    (triple ?gene <http://www.w3.org/2000/01/rdf-schema#label> ?geneLabel)
  ))
 }, SPARQL endpoint at http://kg-hub-rdf.berkeleybop.io/blazegraph/sparql )

--------------------------------
| gene | geneLabel | geneHttps |
================================
--------------------------------
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX wp:      <http://vocabularies.wikipathways.org/wp#>
PREFIX dc:      <http://purl.org/dc/elements/1.1/>
PREFIX dct: <http://purl.org/dc/terms/>
#! endpoint: https://sparql.wikipathways.org/sparql
PREFIX foaf:    <http://xmlns.com/foaf/0.1/>
PREFIX xsd:     <http://www.w3.org/2001/XMLSchema#>
PREFIX bl: <https://w3id.org/biolink/vocab/>
PREFIX up: <http://purl.uniprot.org/core/>
SELECT * #DISTINCT ?pathway ?pathwayLabel #?gene ?geneLabel 

{
    SERVICE <http://kg-hub-rdf.berkeleybop.io/blazegraph/sparql> {
        ?gene bl:category bl:Gene ;
        rdfs:label ?geneLabel .
        #FILTER(?gene = <http://identifiers.org/ensembl/ENSG00000100030>)

        FILTER (?geneLabel = "AP2B1")
    } 
    SERVICE <https://bioregistry.io/sparql> {
        ?gene owl:sameAs ?geneHttps .
        #BIND(uri(replace(str(?gene), "http://identifiers.org/", "https://identifiers.org/")) as ?geneHttps)
    }
#   SERVICE <https://sparql.wikipathways.org/sparql> {
#       ?geneHttps dct:isPartOf ?pathway .
#       ?pathway dc:title ?pathwayLabel .
#   }
}
PREFIX rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#>
PREFIX fd:     <http://www.example.org/se/liu/ida/hefquin/fd#>
PREFIX ex:     <http://example.org/>

ex:bio2rdfSPARQL
      a fd:FederationMember ; fd:interface [ a fd:SPARQLEndpointInterface ; fd:endpointAddress <http://bio2rdf.org/sparql> ] .

ex:bioregistry
      a fd:FederationMember ; fd:interface [ a fd:SPARQLEndpointInterface ; fd:endpointAddress <https://bioregistry.io/sparql> ] .

ex:berkeleybop
      a fd:FederationMember ; fd:interface [ a fd:SPARQLEndpointInterface ; fd:endpointAddress <http://kg-hub-rdf.berkeleybop.io/blazegraph/sparql> ] .

ex:wikipathways
      a fd:FederationMember ; fd:interface [ a fd:SPARQLEndpointInterface ; fd:endpointAddress <https://sparql.wikipathways.org/sparql> ] .
hartig commented 1 year ago

@micheldumontier While investigating this issue I came across a weird entry in the Bioregistry database.

When executing the following query with HeFQUIN ...

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

SELECT * WHERE {
  SERVICE <https://bioregistry.io/sparql> {
    <http://identifiers.org/ensembl/ENSG00000006125> owl:sameAs ?o
  }
}

... HeFQUIN throws the following exception.

<http://bacteria.ensembl.org/[?species_name]/Gene/Summary?g=ENSG00000006125> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
org.apache.jena.irix.IRIException: <http://bacteria.ensembl.org/[?species_name]/Gene/Summary?g=ENSG00000006125> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
        at org.apache.jena.irix.IRIProviderJenaIRI.exceptions(IRIProviderJenaIRI.java:256)
        at org.apache.jena.irix.IRIProviderJenaIRI.newIRIxJena(IRIProviderJenaIRI.java:137)
        at org.apache.jena.irix.IRIProviderJenaIRI.create(IRIProviderJenaIRI.java:145)
        at org.apache.jena.irix.IRIx.create(IRIx.java:54)
        at org.apache.jena.sparql.util.FmtUtils.abbrevByBase(FmtUtils.java:475)
        at org.apache.jena.sparql.util.FmtUtils.stringForURI(FmtUtils.java:460)
        at org.apache.jena.sparql.util.FmtUtils.stringForURI(FmtUtils.java:433)
        at org.apache.jena.sparql.util.FmtUtils.stringForNode(FmtUtils.java:373)
        at org.apache.jena.sparql.util.FmtUtils.stringForNode(FmtUtils.java:347)
        at org.apache.jena.sparql.util.FmtUtils.stringForRDFNode(FmtUtils.java:185)
        at org.apache.jena.riot.resultset.rw.ResultSetWriterText.getVarValueAsString(ResultSetWriterText.java:201)
        at org.apache.jena.riot.resultset.rw.ResultSetWriterText.colWidths(ResultSetWriterText.java:99)
        at org.apache.jena.riot.resultset.rw.ResultSetWriterText.output$(ResultSetWriterText.java:135)
        at org.apache.jena.riot.resultset.rw.ResultSetWriterText.output(ResultSetWriterText.java:120)
        at org.apache.jena.riot.resultset.rw.ResultSetWriterText.output(ResultSetWriterText.java:116)
        at org.apache.jena.riot.resultset.rw.ResultSetWriterText.write(ResultSetWriterText.java:59)
        at org.apache.jena.riot.resultset.rw.ResultsWriter.write(ResultsWriter.java:156)
        at org.apache.jena.riot.resultset.rw.ResultsWriter.write(ResultsWriter.java:126)
        at org.apache.jena.sparql.util.QueryExecUtils.outputResultSet(QueryExecUtils.java:133)
        at org.apache.jena.sparql.util.QueryExecUtils.doSelectQuery(QueryExecUtils.java:150)
        at org.apache.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:81)
        at se.liu.ida.hefquin.engine.HeFQUINEngineBuilder$MyEngine.executeQuery(HeFQUINEngineBuilder.java:169)
        at se.liu.ida.hefquin.cli.RunQueryWithoutSrcSel.exec(RunQueryWithoutSrcSel.java:105)
        at org.apache.jena.cmd.CmdMain.mainMethod(CmdMain.java:92)
        at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:58)
        at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:45)
        at se.liu.ida.hefquin.cli.RunQueryWithoutSrcSel.main(RunQueryWithoutSrcSel.java:49)

I will work on making HeFQUIN more robust (i.e., such that it does simply die in such cases). However, the error is actually valid. That is, the illegal IRI is indeed returned by the Bioregistry SPARQL endpoint. You can check this by going to https://bioregistry.io/sparql, run the following query, and you will see that several IRIs of this invalid form appear in the result.

PREFIX owl: <http://www.w3.org/2002/07/owl#>

SELECT ?o WHERE {
    <http://identifiers.org/ensembl/ENSG00000006125> owl:sameAs ?o
}
hartig commented 1 year ago

I have created an issue about the invalid IRIs in the Bioregistry repo: https://github.com/biopragmatics/bioregistry/issues/803

hartig commented 1 year ago

The reason for the empty result is that the Bioregistry SPARQL endpoint does not support FILTER clauses in queries and HeFQUIN uses the FILTER-based variation as its default implementation of the bind join algorithm. I have filed a corresponding issue in the Bioregistry repo: https://github.com/biopragmatics/bioregistry/issues/804

We also have a VALUES-based implementation and a UNION-based implementation of bind join in HeFQUIN. By using the VALUES-based implementation, the federation query (in the first comment above) works and produces the expected non-empty result. To try this, line 128 in LogicalToPhysicalOpConverter needs to be changed as follows.

        if ( fm instanceof SPARQLEndpoint ) return new PhysicalOpBindJoinWithVALUES(lop);

(i.e., PhysicalOpBindJoinWithFILTER needs to be replaced by PhysicalOpBindJoinWithVALUES)