RDFLib / sparqlwrapper

A wrapper for a remote SPARQL endpoint
https://sparqlwrapper.readthedocs.io/
Other
513 stars 121 forks source link

SPARQLWrapper does not work for `CONSTRUCT` and `DESCRIBE` queries on the UniProt SPARQL endpoint which is Virtuoso #234

Open vemonet opened 4 weeks ago

vemonet commented 4 weeks ago

When running any CONSTRUCT or DESCRIBE query on the UniProt SPARQL endpoint https://sparql.uniprot.org/sparql/, whatever the return format asked (XML, turtle) SPARQLWrapper fails to resolve the query

Code to reproduce:

When asking for XML at least an error is thrown:

from SPARQLWrapper import TURTLE, XML, SPARQLWrapper

query = """PREFIX up: <http://purl.uniprot.org/core/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
CONSTRUCT
{
    ?protein a up:HumanProtein .
}
WHERE
{
    ?protein a up:Protein .
    ?protein up:organism taxon:9606
} LIMIT 10"""

sparql_endpoint = SPARQLWrapper("https://sparql.uniprot.org/sparql/")
sparql_endpoint.setReturnFormat(XML)
sparql_endpoint.setQuery(query)

results = sparql_endpoint.query().convert()
print(results)

Error message:

ExpatError                                Traceback (most recent call last)
Cell In[8], line 20
     17 # sparql_endpoint.setReturnFormat(TURTLE)
     18 sparql_endpoint.setQuery(query)
---> 20 results = sparql_endpoint.query().convert()
     21 print(results)

File ~/dev/.venv/lib/python3.10/site-packages/SPARQLWrapper/Wrapper.py:1190, in QueryResult.convert(self)
   1188 if _content_type_in_list(ct, _SPARQL_XML):
   1189     _validate_format("XML", [XML], ct, self.requestedFormat)
-> 1190     return self._convertXML()
   1191 elif _content_type_in_list(ct, _XML):
   1192     _validate_format("XML", [XML], ct, self.requestedFormat)

File ~/dev/.venv/lib/python3.10/site-packages/SPARQLWrapper/Wrapper.py:1073, in QueryResult._convertXML(self)
   1065 def _convertXML(self) -> Document:
   1066     """
   1067     Convert an XML result into a Python dom tree. This method can be overwritten in a
   1068     subclass for a different conversion method.
   (...)
   1071     :rtype: :class:`xml.dom.minidom.Document`
   1072     """
-> 1073     doc = parse(self.response)
   1074     rdoc = cast(Document, doc)
...
--> 211     parser.Parse(b"", True)
    212 except ParseEscape:
    213     pass

ExpatError: no element found: line 1, column 0

When asking for turtle, SPARQLWrapper does not even throw an error:

from SPARQLWrapper import TURTLE, XML, SPARQLWrapper

query = """PREFIX up: <http://purl.uniprot.org/core/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
CONSTRUCT
{
    ?protein a up:HumanProtein .
}
WHERE
{
    ?protein a up:Protein .
    ?protein up:organism taxon:9606
} LIMIT 10"""

sparql_endpoint = SPARQLWrapper("https://sparql.uniprot.org/sparql/")
# sparql_endpoint.setReturnFormat(XML)
sparql_endpoint.setReturnFormat(TURTLE)
sparql_endpoint.setQuery(query)

results = sparql_endpoint.query().convert()
print(results)

Printing results gives HTML: b'<!DOCTYPE html SYSTEM "about:legacy-compat">\n<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head><title>UniProt</title>......

UniProt uses OpenLink Virtuoso and supports the SPARQL 1.1 Standard.

vemonet commented 4 weeks ago

Using requests with the most logical config to request a SPARQL endpoint just works, so the problem is on SPARQLWrapper doing weird things internally:

import requests
from rdflib import Graph

query = """PREFIX up: <http://purl.uniprot.org/core/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
CONSTRUCT
{
    ?protein a up:HumanProtein .
}
WHERE
{
    ?protein a up:Protein .
    ?protein up:organism taxon:9606
} LIMIT 10"""

response = requests.post(
    "https://sparql.uniprot.org/sparql/",
    headers={
        "Accept": "text/turtle"
    },
    data={
        "query": query
    },
    timeout=60,
)
response.raise_for_status()
g = Graph()
g.parse(data=response.text, format="turtle")

print(response.text)
print(len(g))

In bonus we get basic features like timeout working! (the .setTimeout() option from SPARQLWrapper does not work at all, at least for UniProt endpoint, but this should go in another issue)

JervenBolleman commented 4 weeks ago

UniProt is not pure virtuoso and has some middleware that expects accept headers to ask for an rdf format if using describe and or construct.

vemonet commented 1 week ago

@JervenBolleman SPARQLWrapper also fails to run SELECT queries to SwissLipids https://beta.sparql.swisslipids.org/

Error 500 Internal Server Error</h1><p>The server was not able to handle your request.:

from SPARQLWrapper import XML, SPARQLWrapper, JSON

query = """PREFIX sh: <http://www.w3.org/ns/shacl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?comment ?query
WHERE
{
    ?sq a sh:SPARQLExecutable ;
        rdfs:label|rdfs:comment ?comment ;
        sh:select|sh:ask|sh:construct|sh:describe ?query .
}"""

sparql_endpoint = SPARQLWrapper("https://beta.sparql.swisslipids.org/")
sparql_endpoint.setReturnFormat(XML)
sparql_endpoint.setTimeout(60)
sparql_endpoint.setQuery(query)

results = sparql_endpoint.query().convert()
print(results)

With requests it works:

import requests

query = """PREFIX sh: <http://www.w3.org/ns/shacl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?comment ?query
WHERE
{
    ?sq a sh:SPARQLExecutable ;
        rdfs:label|rdfs:comment ?comment ;
        sh:select|sh:ask|sh:construct|sh:describe ?query .
}"""

response = requests.post(
    "https://beta.sparql.swisslipids.org/",
    headers={
        "Accept": "application/json",
        "User-agent": "sparqlwrapper 2.0.1a0 (rdflib.github.io/sparqlwrapper)"
    },
    data={
        "query": query
    },
    timeout=60,
)
try:
    response.raise_for_status()
    print(response.json())
except requests.exceptions.HTTPError as e:
    print(e)
    print(response.text)