buildinspace / peru

a generic package manager, for including other people's code in your projects
MIT License
1.12k stars 69 forks source link

Cannot download package possibly because of urllib defaults #218

Closed colindean closed 2 years ago

colindean commented 2 years ago

I'm trying to import fontawesome to a project using Peru 1.3.0.

imports:
  fontawesome: deps/font-awesome

curl module fontawesome:
  url: https://use.fontawesome.com/releases/v5.15.4/fontawesome-free-5.15.4-web.zip
  unpack: zip

But it errors:

$ peru sync
In target "fontawesome":
  Error fetching https://use.fontawesome.com/releases/v5.15.4/fontawesome-free-5.15.4-web.zip
  HTTP Error 403: Forbidden

I think it's a urllib problem:

from urllib.error import HTTPError
import urllib.request

url = "https://use.fontawesome.com/releases/v5.15.4/fontawesome-free-5.15.4-web.zip"

try:
  req = urllib.request.urlopen(url)
except HTTPError as e:
  print(e.headers)
# this will fail and print something

# this succeeds
import urllib3
http = urllib3.PoolManager()
resp = http.request("GET", "https://use.fontawesome.com/releases/v5.15.4/fontawesome-free-5.15.4-web.zip")
print(resp.status)
# succeeds

The urllib request fails but urllib3 succeeds. I don't think urllib's HTTPError gives you access to the request headers, so I can't really tell what's wrong here. I'm able to retrieve the file with curl, wget, aria2c, and python via urllib3.

Running peru --verbose sync gave me a stack trace that simply emphasizes "yeah, that's a 403" and the headers I get back from the request don't really say what's wrong. I assume it's some kind of bot protection on Cloudflare…?

colindean commented 2 years ago

Workaround: get the FontAwesome package from GitHub instead.

https://github.com/FortAwesome/Font-Awesome/releases/download/5.15.4/fontawesome-free-5.15.4-web.zip

colindean commented 2 years ago

I think this is still worth trying to add some more debugging information to or at least trying to figure out if the dependency on urllib is problematic when retrieving from Cloudflare-backed URLs.

oconnor663 commented 2 years ago

Interesting, thanks for the detailed report. I've played with it a little bit, and it seems like the key detail is that the User-Agent header has to be set. Here's a minimized repro:

from urllib.request import Request, urlopen

url = "https://use.fontawesome.com/releases/v5.15.4/fontawesome-free-5.15.4-web.zip"
req = Request(url)
# Even with an empty string as the User-Agent header, the request succeeds. But
# if we remove this line, the request fails with error 403.
req.add_header("user-agent", "")
urlopen(req)

@colindean can you confirm this behavior?

I guess it's reasonable that Peru should set something for the UA. Any proposals for what that should be? @olson-sean-k?

colindean commented 2 years ago

Confirmed.

I'd suggest this:

urllib3 uses python-urllib3/{__version__} as its UA, so perhaps it's appropriate for Peru to use

f"peru/{peru_version} python-urllib/{urllib.request.__version__}"

If you look at the source for urllib.request, there are seemingly other ways to create requests that already have the Python default UA but I think the above meets the minimum requirement.

colindean commented 2 years ago

I put up #219 with a stab at a quick fix. It adds the header, at least. I didn't actually test it with my config example above, though, yet.

colindean commented 2 years ago

I tested #219 with the example config and it works! It will fix the problem.

olson-sean-k commented 2 years ago

Thanks for finding (and fixing) this, @colindean!

I guess it's reasonable that Peru should set something for the UA. Any proposals for what that should be?

urllib3 uses python-urllib3/{__version__} as its UA, so perhaps it's appropriate for Peru to use ...

I'm not too sure, but I think there are three reasonable options:

  1. Whatever Python and urllib use by default (e.g., python-urllib3/{__version__}).
  2. Something that identifies peru (e.g., peru/{__version__}).
  3. A combination of the above as @colindean suggested.

FWIW, I think I'd lean a bit toward the first option, but I don't have a strong opinion about it.

colindean commented 2 years ago

219 goes with № 3 in that list and pulls the urllib UA directly from a core class that won't go away. urllib doesn't — IMO — sufficiently abstract this value 👎