ArchiveTeam / wpull

Wget-compatible web downloader and crawler.
GNU General Public License v3.0
545 stars 77 forks source link

url containing & character getting split #478

Closed kashortiexda closed 10 months ago

kashortiexda commented 1 year ago

trying to crawl a site with https://xyz/abc_&def/ghi_&jkl.html

I realise it is a badly formed site but I can't change that.. The url gets 'split' xyz/abc and then a second portion def/ghi jkl...

https://www.krugerpark.co.za/Kruger_National_Park_Lodging_&_Camping_Guide-Travel/Kruger_National_Park_Lodging_&_Camping_Guide.html

output in Terminal _Camping_Guide-Travel/Kruger_National_ParkLodging _Camping_Guide.html

Linux Fedora 38 Python version 3.7.16 Wpull version: 3.09

JustAnotherArchivist commented 1 year ago

I suspect you are not quoting the URL correctly, and it's your shell which 'splits' there (because & has a special meaning to shells). wpull should not have any problems with it. Try wrapping the URL in quotes.

(Also, wpull 3.x is not this repo but https://github.com/ArchiveTeam/ludios_wpull.)

kashortiexda commented 1 year ago

@JustAnotherArchivist Thanks vm. If I wrap the first url, do subsequent urls crawled also get wrapped ? (I doubt) Unfurtunately all subsequent urls also have the & I just checked my Terminal it is Unicode UTF-8, however I saw in the output in terminal after running wpull ..... 404 Not Found. Length: 283 [text/html; charset=iso-8859-1].

TheTechRobo commented 1 year ago

Wrap all subsequent urls in quotes, like you would any other program:

wpull "https://example.com?foo=bar&baz=whatevercomesnext" "https://example.com?baz=bar&foo=d"