RDFLib / sparqlwrapper

A wrapper for a remote SPARQL endpoint
https://sparqlwrapper.readthedocs.io/
Other
513 stars 121 forks source link

unknown response content type on update with Apache Jena Fuseki #159

Closed WolfgangFahl closed 5 months ago

WolfgangFahl commented 3 years ago

I get the error message:

/Users/wf/Library/Python/3.8/lib/python/site-packages/SPARQLWrapper/Wrapper.py:1346: RuntimeWarning: unknown response content type 'text/html;charset=utf-8' returning raw response...
  warnings.warn("unknown response content type '%s' returning raw response..." %(ct), RuntimeWarning)
b'<html>\n<head>\n</head>\n<body>\n<h1>Success</h1>\n<p>\nUpdate succeeded\n</p>\n</body>\n</html>\n'

when running the unit test below. I found

which both do not explain the reason for the problem and e.g. how to work around it. I assume the content-type for Apache Jena must be different. How could i set it?

python unit test

def testJenaInsert(self):
        jena=self.getJena(mode="update")
        insertString = """
        PREFIX cr: <http://cr.bitplan.com/>
        INSERT DATA { 
          cr:version cr:author "Wolfgang Fahl". 
        }
        """
        results=jena.rawQuery(insertString)
        print (results)

jena.py helper module

'''
Created on 2020-08-14

@author: wf
'''
from SPARQLWrapper import SPARQLWrapper, JSON

class Jena(object):
    '''
    wrapper for apache Jana
    '''

    def __init__(self,url,returnFormat=JSON):
        '''
        Constructor
        '''
        self.url=url
        self.sparql=SPARQLWrapper(url,returnFormat=returnFormat)

    def rawQuery(self,queryString,method='POST'):
        '''
        query with the given query string
        '''
        self.sparql.setQuery(queryString)
        self.sparql.method=method
        queryResult = self.sparql.query()
        jsonResult=queryResult.convert()
        return jsonResult 

    def getResults(self,jsonResult):
        '''
        get the result from the given jsonResult
        '''
        return jsonResult["results"]["bindings"]

    def query(self,queryString,method='POST'):
        '''
        get a list of results for the given query
        '''
        jsonResult=self.rawQuery(queryString,method=method)
        return self.getResults(jsonResult)
dayures commented 3 years ago

Maybe this issue is because there result of a SPARQL Update query (in SPARQLWrapper) is not a result that can be converted/parsed, due to the the nature of the specification of the SPARQL protocol.

For SPARQL Query (SELECT, ASK, DESCRIBE, CONSTRUCT),

The response body of a successful query operation with a 2XX response is either:

a SPARQL Results Document in XML, JSON, or CSV/TSV format (for SPARQL Query forms SELECT and ASK); or, an RDF graph [RDF-CONCEPTS] serialized, for example, in the RDF/XML syntax [RDF-XML], or an equivalent RDF graph serialization, for SPARQL Query forms DESCRIBE and CONSTRUCT). https://www.w3.org/TR/2013/REC-sparql11-protocol-20130321/#query-success

And in the case of SPARQL Update,

The response body of a successful update request is implementation defined. Implementations may use HTTP content negotiation to provide both human-readable and machine-processable information about the completed update request. https://www.w3.org/TR/2013/REC-sparql11-protocol-20130321/#update-success

So you will need to distinguish the post-process of a SPARQL query request and a SPARQL update request.

You can see an example of SPARQL Update request in the documentation. https://sparqlwrapper.readthedocs.io/en/stable/main.html#sparql-update-example

WolfgangFahl commented 3 years ago

see also http://wiki.bitplan.com/index.php/DgraphAndWeaviateTest - i am sending an e-mail to the apache users list now to get more info

afs commented 3 years ago

text/html;charset=utf-8 not a Fuseki response from the SPARQL engine. Was the endpoint amn HTML page?

WolfgangFahl commented 3 years ago

@afs thank you for looking into this. I followed the procedure in http://wiki.bitplan.com/index.php/DgraphAndWeaviateTest#Apache_Jena

scripts/jena -l sampledata/example.ttl
scripts/jena -f example

which will install apache jena - load the sample data and fire up the fuseki server.

apache-jena-fuseki-3.16.0
apache-jena-3.16.0
start loading sampledata/example.ttl to /Users/wf/Documents/py-workspace/DgraphAndWeaviateTest/data at 2020-08-15T14:41:24Z
finished loading sampledata/example.ttl to /Users/wf/Documents/py-workspace/DgraphAndWeaviateTest/data at 2020-08-15T14:41:26Z
16:41:26 INFO  loader          :: Loader = LoaderPhased
16:41:26 INFO  loader          :: Start: sampledata/example.ttl
16:41:26 INFO  loader          :: Finished: sampledata/example.ttl: 20 tuples in 0.07s (Avg: 298)
16:41:26 INFO  loader          :: Finish - index SPO
16:41:26 INFO  loader          :: Start replay index SPO
16:41:26 INFO  loader          :: Index set:  SPO => SPO->POS, SPO->OSP
16:41:26 INFO  loader          :: Index set:  SPO => SPO->POS, SPO->OSP [20 items, 0.0 seconds]
16:41:26 INFO  loader          :: Finish - index OSP
16:41:26 INFO  loader          :: Finish - index POS

scripts/jena -f example
apache-jena-fuseki-3.16.0
apache-jena-3.16.0
starting fuseki server
16:41:52 INFO  Server          :: Running in read-only mode for /example
16:41:52 INFO  Server          :: Apache Jena Fuseki 3.16.0
16:41:52 INFO  Config          :: FUSEKI_HOME=/Users/wf/Documents/pyworkspace/DgraphAndWeaviateTest/lib/apache-jena-fuseki-3.16.0/.
16:41:52 INFO  Config          :: FUSEKI_BASE=/Users/wf/Documents/pyworkspace/DgraphAndWeaviateTest/lib/apache-jena-fuseki-3.16.0/run
16:41:52 INFO  Config          :: Shiro file: file:///Users/wf/Documents/pyworkspace/DgraphAndWeaviateTest/lib/apache-jena-fuseki-3.16.0/run/shiro.ini
16:41:52 INFO  Config          :: Template file: templates/config-tdb2-dir
16:41:52 INFO  Config          :: TDB dataset: directory=/Users/wf/Documents/py-workspace/DgraphAndWeaviateTest/data

I took the endpoint info from http://localhost:3030/dataset.html which states that SPARQL Query would be at: /example/query and SPARQL Update at /example/update

Thus in the unit test:

def getJena(self,mode='query'):
        endpoint="http://localhost:3030/example/%s" % mode
        jena=Jena(endpoint)
        return jena
afs commented 3 years ago

text/html;charset=utf-8 comes from somewhere but I can't make it happen with Fuseki 3.16.0.

Even static HTML pages come back Content-Type: text/html -- no charset.

What does the Fuseki server log file contain? If there are no entries for the request, then the request didn't reach Fuseki.

WolfgangFahl commented 3 years ago

@afs The log has:

18:57:10 INFO  Fuseki          :: [60] POST http://localhost:3030/example/update
18:57:10 INFO  Fuseki          :: [60] 200 OK (7 ms)

I tried a curl request

curl --data-binary @insert.txt http://localhost:3030/example/update
Error 400: Bad Request

with insert.txt=

PREFIX cr: <http://cr.bitplan.com/>
        INSERT DATA { 
          cr:version cr:author "Wolfgang Fahl". 
        }

but get a Error 400: Bad Request.

I tried debugging the http call in my Python IDE. But due to all the abstraction layers it's pretty hard. full_url is: str: http://localhost:3030/example/update 'Content-type' (4341599984) str: application/x-www-form-urlencoded
'User-agent' (4341609136) str: sparqlwrapper 1.8.5 (rdflib.github.io/sparqlwrapper)
'Accept' (4341610160) str: application/sparql-results+json,application/json,text/javascript,application/javascript

data: bytes: b'update=%0A++++++++PREFIX+cr%3A+%3Chttp%3A//cr.bitplan.com/%3E%0A++++++++INSERT+DATA+%7B+%0A++++++++++cr%3Aversion+cr%3Aauthor+%22Wolfgang+Fahl%22.+%0A++++++++%7D%0A++++++++'

The raw response is: headers HTTPMessage: Connection: close\nDate: Sat, 15 Aug 2020 17:06:28 GMT\nFuseki-Request-ID: 63\nContent-Type: text/html;charset=utf-8\n\n

The SPARQLWrapper code then expects a content-type of XML, JSON, RDF/XML, N3, CSV, JSON-LD and since none of these is found a warning is issued (unfortunately unconditionally - since the call is ok it would be sufficient to ignore the problem or just look for the html content having the success message).

afs commented 3 years ago

re: curl --data-binary @insert.txt -- this sends the content in the body -- you need to set the content type.

curl -v -g --header 'Content-type: application/sparql-update' --data-binary 'INSERT DATA{}' http://localhost:3030/example/update

else it is application/x-www-form-urlencoded, in which case you need "update=" in insert.txt.

curl -v -g -d'update=INSERT DATA{}' http://localhost:3030/example/update


The Fuseki response to application/x-www-form-urlencoded` is an HTML page -- i.e. something displayable -- which is reasonable because it was sent an HTML form.

If the expected content is an RDF format, then it look like the client code is processing it more like a query.

28 suggests there is now a way to ask SPARQLWrapper use POST and Content-type: application/sparql-update with the body holding the update request in UTF-8 by setting the Content-type. This is the better way - HTML forms have size limitations in practice.

For an update response - only the status code is needed - an application can ignore the response body (but it must consume the bytes to preserve connection caching). The "readthedocs" reference looks right.

WolfgangFahl commented 3 years ago

@afs, @dayures thank you for your effort which lead to finding out how to do things:

self.sparql.setRequestMethod(POSTDIRECTLY)

is the key to properly handling updates. The documentation might want to more prominently point this out. E.g. there is no example in the scripts directory showing the usage.

'''
Created on 2020-08-14

@author: wf
'''
from SPARQLWrapper import SPARQLWrapper, JSON
from SPARQLWrapper.Wrapper import POSTDIRECTLY, POST

class Jena(object):
    '''
    wrapper for apache Jana
    '''

    def __init__(self,url,mode='query',returnFormat=JSON):
        '''
        Constructor
        '''
        self.url="url%s" % (mode)
        self.mode=mode
        self.sparql=SPARQLWrapper(url,returnFormat=returnFormat)

    def rawQuery(self,queryString,method='POST'):
        '''
        query with the given query string
        '''
        self.sparql.setQuery(queryString)
        self.sparql.method=method
        queryResult = self.sparql.query()
        return queryResult 

    def getResults(self,jsonResult):
        '''
        get the result from the given jsonResult
        '''
        return jsonResult["results"]["bindings"]

    def insert(self,insertCommand):
        '''
        run an insert
        '''
        self.sparql.setRequestMethod(POSTDIRECTLY)
        response=self.rawQuery(insertCommand, method=POST)
        return response

    def query(self,queryString,method=POST):
        '''
        get a list of results for the given query
        '''
        queryResult=self.rawQuery(queryString,method=method) 
        jsonResult=queryResult.convert()
        return self.getResults(jsonResult)
afs commented 3 years ago

It would be better if the update read the response body and threw it away.

For example, it may be a parse error and there is an error message but even for a zero length body, it is better to read it and hit the end of stream.

For any HTTP usage, if the caller does not read all of the response body, the connection can not be reused for another request because to the HTTP code it looks like it is still in use. For a few requests this may not matter in the client, though it is unhelpful in the server and may impact other clients. It is slower to open an TCP connection for every request.

(This is not specific to the SPARQL protocol - it applies to all HTTP usage.)

WolfgangFahl commented 3 years ago

@afs - i think i get an empty response in case of success and an exception if case of error that's what http://wiki.bitplan.com/index.php/DgraphAndWeaviateTest#Apache_unit_test now tests. How come you assume the body is not read?

afs commented 3 years ago

Because I can't see a line that does it (not that I know SPARQLWrapper but there was an SO quertion a while back that came down to holding connections open and eventually the server ran out of serving threads.

    response=self.rawQuery(insertCommand, method=POST)
    return response

"response" is "queryResult" -- the document has results.response.read() (different 'response').

This will not show up in a unit test. I don't know if reading the header and status code also causes reading the whole of the body (which would be non-streaming).

WolfgangFahl commented 3 years ago

@afs - thanks for the hint - i added a dummy line. Now I am stuck at https://stackoverflow.com/questions/63435157/listofdict-to-rdf-conversion-in-python-targeting-apache-jena-fuseki

WolfgangFahl commented 3 years ago

@afs @dayures The result of all this is an extension of the SPARQLWrapper at https://github.com/WolfgangFahl/DgraphAndWeaviateTest - for this issue only the documentation part is open. I am going to open a new issue regading the ListOfDict conversion

WolfgangFahl commented 5 months ago

With Jena 4.9.0 and SparqlWrapper2.0 i now get Exception: HTTP Error 415: Unsupported Media Type

afs commented 5 months ago

Jena does not return the message "Unsupported Media Type". The 415 cases have a different message.

"Unsupported Media Type" is the generic error message so it is not clear the operation is going to Fuseki at all.

Check the server log.

If the operation gets there it is logged with an error message. You call also run it "-v" to get a detailed HTTP request log.

In case it is calling Fuseki and the error message wasn't available (this happens with HTTP/2 - the server log has the correct error message in it):

If it is to a update specific endpoint (.../update) - there is a Content-type but it's not right for update. The correct MIME type is "application/sparql-update", or an HTML form that includes "request=" (see above https://github.com/RDFLib/sparqlwrapper/issues/159#issuecomment-674427602).

WolfgangFahl commented 5 months ago

@afs Thank you for the swift response! Jena 4.10 needs a --update on start and has different endpoints for update and query. Thanks to Tim Holzheim for finding these details and changing our test setup accordingly.