RDFLib / sparqlwrapper

A wrapper for a remote SPARQL endpoint
https://sparqlwrapper.readthedocs.io/
Other
520 stars 122 forks source link

HTTP Error 403: Forbidden #139

Closed bngksgl closed 3 years ago

bngksgl commented 5 years ago

Hi,

I am trying to use sparqlwrapper inside python to query wikidata. Last week my code was working without a problem, however today I am receiving 'HTTPError: HTTP Error 403: Forbidden' error: I also tried with using requests, i am still getting the same error. How can i overcome this issue? At below you may find the code i am using and the error.

Code: agent_={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) ' 'AppleWebKit/537.11 (KHTML, like Gecko) ' 'Chrome/23.0.1271.64 Safari/537.11', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8', 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3', 'Accept-Encoding': 'none', 'Accept-Language': 'en-US,en;q=0.8', 'Connection': 'keep-alive'}

sparql = SPARQLWrapper("https://query.wikidata.org/sparql",agent=agent_) sparql.setQuery("""SELECT * { SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } { SELECT ?business ?businessLabel ?altLabel WHERE {

an entity has a key ID and a usage count

  ?item wdt:P31 wd:Q327333.
  OPTIONAL { ?business skos:altLabel ?altLabel . FILTER (lang(?altLabel) = "en") }
}

} UNION { SELECT ?business ?businessLabel ?altLabel WHERE {

an entity has a key ID and a usage count

  ?item wdt:P31 wd:Q20658380.
  OPTIONAL { ?business skos:altLabel ?altLabel . FILTER (lang(?altLabel) = "en") }
}

}

}""") sparql.setReturnFormat(JSON) data = sparql.query().convert()

Error HTTPError Traceback (most recent call last)

in 19 }""") 20 sparql.setReturnFormat(JSON) ---> 21 data = sparql.query().convert() ~\AppData\Roaming\Python\Python37\site-packages\SPARQLWrapper\Wrapper.py in query(self) 925 @rtype: L{QueryResult} instance 926 """ --> 927 return QueryResult(self._query()) 928 929 def queryAndConvert(self): ~\AppData\Roaming\Python\Python37\site-packages\SPARQLWrapper\Wrapper.py in _query(self) 905 raise EndPointInternalError(e.read()) 906 else: --> 907 raise e 908 909 def query(self): ~\AppData\Roaming\Python\Python37\site-packages\SPARQLWrapper\Wrapper.py in _query(self) 891 response = urlopener(request, timeout=self.timeout) 892 else: --> 893 response = urlopener(request) 894 return response, self.returnFormat 895 except urllib.error.HTTPError as e: ~\AppData\Local\Continuum\anaconda3\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context) 220 else: 221 opener = _opener --> 222 return opener.open(url, data, timeout) 223 224 def install_opener(opener): ~\AppData\Local\Continuum\anaconda3\lib\urllib\request.py in open(self, fullurl, data, timeout) 529 for processor in self.process_response.get(protocol, []): 530 meth = getattr(processor, meth_name) --> 531 response = meth(req, response) 532 533 return response ~\AppData\Local\Continuum\anaconda3\lib\urllib\request.py in http_response(self, request, response) 639 if not (200 <= code < 300): 640 response = self.parent.error( --> 641 'http', request, response, code, msg, hdrs) 642 643 return response ~\AppData\Local\Continuum\anaconda3\lib\urllib\request.py in error(self, proto, *args) 567 if http_err: 568 args = (dict, 'default', 'http_error_default') + orig_args --> 569 return self._call_chain(*args) 570 571 # XXX probably also want an abstract factory that knows when it makes ~\AppData\Local\Continuum\anaconda3\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, *args) 501 for handler in handlers: 502 func = getattr(handler, meth_name) --> 503 result = func(*args) 504 if result is not None: 505 return result ~\AppData\Local\Continuum\anaconda3\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs) 647 class HTTPDefaultErrorHandler(BaseHandler): 648 def http_error_default(self, req, fp, code, msg, hdrs): --> 649 raise HTTPError(req.full_url, code, msg, hdrs, fp) 650 651 class HTTPRedirectHandler(BaseHandler): HTTPError: HTTP Error 403: Forbidden
dayures commented 5 years ago

Hi @bngksgl I asked some people involved in @wikidata and it looks like there is an issue with the SPARQL endpoint. Hopefully it will be solved soon.

bngksgl commented 5 years ago

Hi thank you for the feedbcak @dayures . I just checked today as well and its still not working. Did you hear back from people in @Wikidata?

dayures commented 5 years ago

Hi @bngksgl I haven't received any feedback. I tested the query today and it returns a timeout error (not a 403).

BTW, could you double check your query? Maybe it could be re-organize in a way that it doesn't overload the server. Also, it appears "?item" in the query. Is that ok, or maybe is "?business" ?

For instance, this query (that is part of the query that you shared), already returns +11K results.

https://query.wikidata.org/#%23Goats%0ASELECT%20%3Fitem%20%3FitemLabel%20%3FaltLabel%0AWHERE%20%0A%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ327333.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%20%20%0A%20%20OPTIONAL%20%7B%20%3Fitem%20skos%3AaltLabel%20%3FaltLabel%20.%20FILTER%20%28lang%28%3FaltLabel%29%20%3D%20%22en%22%29%20%7D%0A%0A%7D

chicocvenancio commented 5 years ago

Maybe we need to set the UserAgent? It seems the sparqlwrapper UserAgent may be blocked on their end. https://lists.wikimedia.org/pipermail/wikidata/2019-July/013247.html

ookgezellig commented 5 years ago

See also https://phabricator.wikimedia.org/T230135

dayures commented 5 years ago

@chicocvenancio @ookgezellig thanks for the feedback! Do you know if it is possible to access to the black-listed user agents somehow?

bngksgl commented 5 years ago

@dayures thanks for your comment! I changed my IP address, and it seems to work now. I think the problem was due to excessive querying my IP adress was blocked from their servers. Therefore, I reworked the query to create less hustle for the servers.

lioneltrebuchon commented 4 years ago

1) If one cannot easily change IP address, do you know how long it takes for the IP to be removed from the blacklist? 2) How do you get onto this blacklist? Is it amount of requests, or amount of returned rows? Does anyone know?

lucaswerkmeister commented 4 years ago

You shouldn’t change IP address at all. Rate limits are per client, which is defined as IP address + user agent, so what you should do is set a good user agent in accordance with the User-Agent policy. The limits are also explained here – basically, you get 60 seconds of query runtime per 60 seconds of real time. (In other words, you can briefly run queries in parallel, but not continuously.) If you get an HTTP 429 Too Many Requests error from the server, stop sending queries altogether until the time specified in the Retry-After response header; if you fail to do that, your client will be banned for 24 hours.

The amount of data returned has no impact, as far as I’m aware, at least as long as you don’t cause timeout errors. (If you do cause errors, there’s a limit of 30 errors per minute.) And I’m not aware of any ban longer than these 24 hours.


Edit: To set the User-Agent, pass the agent parameter into the SPARQLWrapper constructor, for example:

from SPARQLWrapper import SPARQLWrapper
wrapper = SPARQLWrapper('https://query.wikidata.org/sparql',
                        agent='example-UA (https://example.com/; mail@example.com)')

(Note that SPARQLWrapper2 is missing this parameter, see #162.)

dayures commented 3 years ago

Thanks for contributing to this issue. As it has been more than 90 days since the last activity, we are automatically closing the issue. This is often because the request was already solved in some way and it just wasn't updated or it's no longer applicable. If that's not the case, please do feel free to either reopen this issue or open a new one. We'll gladly take a look again!