ArchiveTeam / wpull

Wget-compatible web downloader and crawler.
GNU General Public License v3.0
545 stars 77 forks source link

Replace buggy urllib.parse #441

Open JustAnotherArchivist opened 4 years ago

JustAnotherArchivist commented 4 years ago

Python's URL parsing with urllib.parse.urlparse works well for the most common formats, but it quickly breaks down in edge or corner cases. This caused ArchiveBot job 33k8egvaa5dsfxva1s0lsnmv4 to crash with an Invalid IPv6 address error on the URL http://[email=%22info@epic4health.com/, which is odd but perfectly parseable per the URL Standard (though it would produce a validation error since credentials are not allowed in valid URLs). There is a list of various similar issues at https://bugs.python.org/issue36338#msg355322.

Because urllib works fine in most cases, there aren't many alternative URL parser packages. A promising candidate is whatwg-url (repo), which is an implementation of the URL Standard.