ArchiveTeam / ludios_wpull

wpull fork with fixes and faster parsing using html5-parser; used by grab-site; should go away when wpull is similarly improved
GNU General Public License v3.0
26 stars 5 forks source link

cookies.txt is not properly used during crawls #21

Open systwi-again opened 2 years ago

systwi-again commented 2 years ago

What I wanted/expected: Cookies, read from the provided cookies.txt, to be used during crawls with wpull.

What happened: wpull ignores the provided cookies.txt file and crawls without it.

The command or website causes the problem: --load-cookies=/absolute/path/to/cookies.txt

Operating system: Debian GNU/Linux 11 (x86_64)

Python version: 3.8.13

Wpull version: 3.0.9

Options used with wpull (obtained using grab-site's --which-wpull-args-partial):

-U 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'
--header 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
--header 'Accept-Language: en-US,en;q=0.5'
--no-check-certificate
--no-robots
--inet4-only
--dns-timeout 20
--connect-timeout 20
--read-timeout 900
--session-timeout 172800
--tries 3
--waitretry 5
--max-redirect 8
--output-file wpull.log
--database wpull.db
--save-cookies cookies.txt
--delete-after
--page-requisites
--concurrent 2
--warc-file example.com-2022-05-12-099f53ca
--warc-max-size 5368709120
--warc-cdx
--strip-session-id
--escaped-fragment
--level inf
--page-requisites-level 5
--span-hosts-allow page-requisites,linked-pages
--debug-manhole
--sitemaps
--load-cookies=/absolute/path/to/cookies.txt
--keep-session-cookies
https://example.com/

Further details and temporary workaround here.

Even giving cookies.txt 777 permissions, wpull still refuses to use the cookies in cookies.txt during crawls.

The filesystem used for everything is ext4, has no I/O errors, has ample free space, passes fsck.ext4, and the absolute path contains no spaces or special characters of any kind (just lowercase a-z).

cookies.txt was exported using version 0.3 of this Firefox extension under Firefox 78.15.0esr on the same OS, and was not modified after exporting.

TheTechRobo commented 2 years ago

I can't test this (busy), sorry.

To potentially narrow down when this bug happens, could you go to http://thetechrobo.ca:1111, verify it says that the cookie isn't set, then go to http://thetechrobo.ca:1111/set to set the cookie? Then export the cookies, run wpull on http://thetechrobo.ca:1111 with the cookies, and see if it says the cookies are set.

systwi-again commented 2 years ago

Hmm, oddly enough your site did work with --load-cookies. Can't explain why it's an outlier...

TheTechRobo commented 2 years ago

Again, I've had --load-cookies work before, like with Planet French, but not with Infos-Ados.

Are you sure there aren't any #HttpOnly lines int he cookies.txt...?

systwi-again commented 2 years ago

Okay, I thought maybe it was an issue with my particular setup for some reason.

Regarding #HttpOnly lines, there is but one instance. It's my school's proprietary web portal that I'm trying to save. I can save it only using the aforementioned workaround, which doesn't send any #HttpOnly cookies anyway, so I take it that that cookie is not as important. ¯\_(ツ)_/¯ I don't know.

TheTechRobo commented 2 years ago

What was that workaround? I can't find it.`Neverind, found it.