Closed nklsbckmnn closed 1 year ago
import urllib.request
url = "http://purl.obolibrary.org/obo/hp.owl"
response = urllib.request.urlopen(url)
Is this behaviour new?
This works:
from urllib import request
from urllib.request import Request, urlopen
url = "https://purl.obolibrary.org/obo/hp.owl"
request_site = Request(url, headers={"User-Agent": "Mozilla/5.0"})
webpage = urlopen(request_site)
But I am wondering if the recent changes to PURL system now cause:
https://www.pythonpool.com/urllib-error-httperror-http-error-403-forbidden
The wheregoes trace works:
Yes it's new. I think it still worked on Friday. Maybe some abuse-suspecting user agent block by GitHub?
On Friday we changed something in our PURL config, cc @kltm, but it is a bit odd that all other tools other than urllib.request
work - wget / curl / wheregoes.
Thanks for the report!
EDIT: Posted too soon, the following is incorrect.
~I think the problem is GitHub, not the PURL server. The PURL server redirects http://purl.obolibrary.org/obo/hp.owl to https://github.com/obophenotype/human-phenotype-ontology/releases/latest/download/hp.owl ( https://github.com/OBOFoundry/purl.obolibrary.org/blob/master/config/hp.yml#LL9C11-L9C99). This code gives me a 403:~
import urllib.request
url = "https://github.com/obophenotype/human-phenotype-ontology/releases/latest/download/hp.owl"
response = urllib.request.urlopen(url)
Normal requests
also works:
import requests
r = requests.get('http://purl.obolibrary.org/obo/omo.owl', allow_redirects=True)
open('omo.owl', 'wb').write(r.content)
Using @eliasweatherfield code in try catch also works:
import urllib.request
url = "http://purl.obolibrary.org/obo/omo.owl"
try:
response = urllib.request.urlopen(url)
except:
print("Ignore this error")
print(response.read(100))
This suggests that the request is successful, but the error is thrown regardless.
I can confirm the 403 described by @eliasweatherfield in Python 3.9 and 3.11. I think @matentzn is seeing an old response
object, because I get a NameError: name 'response' is not defined
error from the final line print(response.read(100))
.
Ok, now I think that Cloudflare is rejecting the request, which makes sense given the timing of this issue:
import urllib.request
url = "http://purl.obolibrary.org/obo/hp.owl"
try:
response = urllib.request.urlopen(url)
except urllib.error.HTTPError as e:
print(e)
print(e.code)
print(e.reason)
print(e.headers)
HTTP Error 403: Forbidden
403
Forbidden
Date: Mon, 05 Jun 2023 17:48:43 GMT
Content-Type: text/plain; charset=UTF-8
Content-Length: 16
Connection: close
X-Frame-Options: SAMEORIGIN
Referrer-Policy: same-origin
Cache-Control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 01 Jan 1970 00:00:01 GMT
Vary: Accept-Encoding
Server: cloudflare
CF-RAY: 7d2a3f81995dcab8-YYZ
I think the cause is Cloudflare's Browser Integrity Check, which is a security setting that can be turned off: https://developers.cloudflare.com/support/firewall/settings/understanding-the-cloudflare-browser-integrity-check/
@jamesaoverton I believe that I've turned off BIC for this domain (Cloudflare docs are apparently wildly out of date and not great to begin with).
Thanks @kltm! I'm now getting a 200 response from the first test code posted above -- no more error.
@eliasweatherfield Can you confirm that this is now working for you?
Yes, it's working again. Thanks everyone.
Thanks for the report!
Please provide example PURLs and code so we can try to replicate.