purl.obolibrary.org URLs return 403 when using urllib.request (Python)

OBOFoundry / purl.obolibrary.org

A system for managing OBO PURLs

BSD 3-Clause "New" or "Revised" License

75 stars 128 forks source link

purl.obolibrary.org URLs return 403 when using urllib.request (Python) #923

Closed nklsbckmnn closed 1 year ago

jamesaoverton commented 1 year ago

Please provide example PURLs and code so we can try to replicate.

nklsbckmnn commented 1 year ago

import urllib.request

url = "http://purl.obolibrary.org/obo/hp.owl" 

response = urllib.request.urlopen(url)

matentzn commented 1 year ago

Is this behaviour new?

This works:

from urllib import request
from urllib.request import Request, urlopen

url = "https://purl.obolibrary.org/obo/hp.owl"
request_site = Request(url, headers={"User-Agent": "Mozilla/5.0"})
webpage = urlopen(request_site)

But I am wondering if the recent changes to PURL system now cause:

https://www.pythonpool.com/urllib-error-httperror-http-error-403-forbidden

The wheregoes trace works:

https://wheregoes.com/trace/20232599746/

nklsbckmnn commented 1 year ago

Yes it's new. I think it still worked on Friday. Maybe some abuse-suspecting user agent block by GitHub?

nklsbckmnn commented 1 year ago

Although requesting https://github.com/obophenotype/human-phenotype-ontology/releases/latest/download/hp.owl or https://github.com/obophenotype/human-phenotype-ontology/releases/download/v2023-04-05/hp.owl works.

matentzn commented 1 year ago

On Friday we changed something in our PURL config, cc @kltm, but it is a bit odd that all other tools other than urllib.request work - wget / curl / wheregoes.

Thanks for the report!

jamesaoverton commented 1 year ago

EDIT: Posted too soon, the following is incorrect.

~I think the problem is GitHub, not the PURL server. The PURL server redirects http://purl.obolibrary.org/obo/hp.owl to https://github.com/obophenotype/human-phenotype-ontology/releases/latest/download/hp.owl ( https://github.com/OBOFoundry/purl.obolibrary.org/blob/master/config/hp.yml#LL9C11-L9C99). This code gives me a 403:~

import urllib.request
url = "https://github.com/obophenotype/human-phenotype-ontology/releases/latest/download/hp.owl" 
response = urllib.request.urlopen(url)

matentzn commented 1 year ago

Normal requests also works:

import requests
r = requests.get('http://purl.obolibrary.org/obo/omo.owl', allow_redirects=True)
open('omo.owl', 'wb').write(r.content)

matentzn commented 1 year ago

Using @eliasweatherfield code in try catch also works:

import urllib.request

url = "http://purl.obolibrary.org/obo/omo.owl"

try:
    response = urllib.request.urlopen(url)
except:
    print("Ignore this error")

print(response.read(100))

This suggests that the request is successful, but the error is thrown regardless.

jamesaoverton commented 1 year ago

I can confirm the 403 described by @eliasweatherfield in Python 3.9 and 3.11. I think @matentzn is seeing an old response object, because I get a NameError: name 'response' is not defined error from the final line print(response.read(100)).

jamesaoverton commented 1 year ago

Ok, now I think that Cloudflare is rejecting the request, which makes sense given the timing of this issue:

import urllib.request
url = "http://purl.obolibrary.org/obo/hp.owl"
try:
    response = urllib.request.urlopen(url)
except urllib.error.HTTPError as e:
    print(e)
    print(e.code)
    print(e.reason)
    print(e.headers)

HTTP Error 403: Forbidden
403
Forbidden
Date: Mon, 05 Jun 2023 17:48:43 GMT
Content-Type: text/plain; charset=UTF-8
Content-Length: 16
Connection: close
X-Frame-Options: SAMEORIGIN
Referrer-Policy: same-origin
Cache-Control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 01 Jan 1970 00:00:01 GMT
Vary: Accept-Encoding
Server: cloudflare
CF-RAY: 7d2a3f81995dcab8-YYZ

jamesaoverton commented 1 year ago

I think the cause is Cloudflare's Browser Integrity Check, which is a security setting that can be turned off: https://developers.cloudflare.com/support/firewall/settings/understanding-the-cloudflare-browser-integrity-check/

kltm commented 1 year ago

@jamesaoverton I believe that I've turned off BIC for this domain (Cloudflare docs are apparently wildly out of date and not great to begin with).

jamesaoverton commented 1 year ago

Thanks @kltm! I'm now getting a 200 response from the first test code posted above -- no more error.

@eliasweatherfield Can you confirm that this is now working for you?

nklsbckmnn commented 1 year ago

Yes, it's working again. Thanks everyone.

jamesaoverton commented 1 year ago

Thanks for the report!