RDFLib / sparqlwrapper

A wrapper for a remote SPARQL endpoint
https://sparqlwrapper.readthedocs.io/
Other
520 stars 122 forks source link

HTTP Error 504 when returning results #136

Closed RdNetwork closed 3 years ago

RdNetwork commented 5 years ago

I am using SPARQLWrapper to update and manage data (big RDF graphs, the biggest having 130M+ triples) on a Virtuoso 7 server.

One of my processes requires a full graph copy (using COPY from the SPARQL 1.1 specification). My code works fine when working on smaller graphs, but when working on my biggest graph, SPARQLWrapper returns en HTTP 504 Error ("Gateway Timeout").

When checking in Virtuoso, it turns out that the graph is actually done copying (the new graphs have the right amount of RDF triples), but the error happens nonetheless. I guess this is therefore an error with the way SPARQLWrapper fetches the results, rather than an actual Virtuoso server error.

Here is the code (the queryType is related to a bug I found out about some time ago, not sure if it is still relevant):

sparql = SPARQLWrapper.SPARQLWrapper(ENDPOINT)
sparql.setTimeout(5000)
[...some other, smaller queries that work fine...]
start = time.time()
print "Copying graph..."
sparql.setMethod(SPARQLWrapper.POST)
sparql.setQuery('DEFINE sql:log-enable 2 COPY GRAPH <'+OLD_GRAPH+'> to GRAPH <'+new_graph_name+'>')
sparql.queryType = SPARQLWrapper.SELECT
sparql.query()
end = time.time()
print "\tDone! (Took "+str(end-start)+" seconds)"

Here is the output I get on my Python console:

Fetching the number of existing blank nodes in the original graph...
                Took 19.1267940998 seconds to count objects (0).
                Took 12.5638158321 seconds to count predicates (0).
                Took 12.2738580704 seconds to count subjects (0).
        Done! (Took 43.9792511463 seconds)
Copying graph...
Traceback (most recent call last):
  File "main.py", line 230, in <module>
    main()
  File "main.py", line 227, in main
    run_eval(NB_MUT_THREADS, NB_MUTATIONS,True,True)
  File "/home/ubuntu/safe-lod-anonymizer/exp.py", line 85, in run_eval
    sparql.query()
  File "/home/ubuntu/.local/lib/python2.7/site-packages/SPARQLWrapper/Wrapper.py", line 927, in query
    return QueryResult(self._query())
  File "/home/ubuntu/.local/lib/python2.7/site-packages/SPARQLWrapper/Wrapper.py", line 907, in _query
    raise e
urllib2.HTTPError: HTTP Error 504: Gateway Timeout

Yet, the Virtuoso server and web server are both still running and fine. Is there a way to prevent this?

Thanks in advance.

dayures commented 5 years ago

Hi @RdNetwork

Which version of SPARQLWrapper are you using?

Were you able to reproduce this issue outside SPARQLWrapper using tools such as curl? Just in order to be sure that the issue is inside SPARQLWrapper and it is not a server-side issue.

Thanks!

RdNetwork commented 5 years ago

Hi,

Sorry for the big delay... Had other priorities at the time and problems with other stuff and I forgot to take the time to work on this again.

So, some additional information:

This seems to be a timeout problem to me, but I can't really find the exact root cause.

dayures commented 5 years ago

Thanks for the feedback.

Did you try to set the timeout to a lower value (like 5 seconds)? Is it working as expected with that amount?

RdNetwork commented 5 years ago

When I set the timeout to a lower value (e.g. setTimeout(5)) I have another kind of timeout:

Traceback (most recent call last):
  File "main.py", line 230, in <module>
    main()
  File "main.py", line 227, in main
    run_eval(NB_MUT_THREADS, NB_MUTATIONS,True,True)
  File "/home/ubuntu/safe-lod-anonymizer/exp.py", line 87, in run_eval
    sparql.query()
  File "/home/ubuntu/.local/lib/python2.7/site-packages/SPARQLWrapper/Wrapper.py", line 927, in query
    return QueryResult(self._query())
  File "/home/ubuntu/.local/lib/python2.7/site-packages/SPARQLWrapper/Wrapper.py", line 891, in _query
    response = urlopener(request, timeout=self.timeout)
  File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1201, in do_open
    r = h.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1121, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 438, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 394, in _read_status
    line = self.fp.readline(_MAXLINE + 1)
  File "/usr/lib/python2.7/socket.py", line 480, in readline
    data = self._sock.recv(self._rbufsize)
socket.timeout: timed out

Which indeed happens after 5 seconds. I feel like the original error doesn't come from a query timeout (i.e. all the results are fetched) but afterwards (i.e. when the result set is built in Python).

dayures commented 4 years ago

Any update about this issue? I am not so sure that the problem is when the result set is built in Python (due to the HTTP error).

dayures commented 3 years ago

Thanks for contributing to this issue. As it has been more than 90 days since the last activity, we are automatically closing the issue. This is often because the request was already solved in some way and it just wasn't updated or it's no longer applicable. If that's not the case, please do feel free to either reopen this issue or open a new one. We'll gladly take a look again!