ArchiveTeam / wpull

Wget-compatible web downloader and crawler.
GNU General Public License v3.0
545 stars 77 forks source link

HTML in JavaScript leads to undecoded character references in URLs #460

Open JustAnotherArchivist opened 3 years ago

JustAnotherArchivist commented 3 years ago

When wpull encounters HTML inside JavaScript strings (or a JSON API), it does not decode character references on extracted URLs because it does not treat HTML in JS strings specially at all. This causes frequent & appearances in URLs. Further, if a numeric character references (&#nnn;) is involved, part of the URL is dropped entirely on parsing as everything after the hash is treated as the fragment (seen in ArchiveBot job 51nt0cax16fen2l8kv14kraon).

I'm not sure what the best strategy here is. Trying to detect whether a JS string contains HTML is probably expensive and may not be worth it. Attempting to decode char refs in JS-extracted URLs may be worth exploring though.