Cookies not staying - Githubissues

TheTechRobo commented 3 years ago

I'm trying to archive infos-ados.com and after I get the cookies (to stay signed in), and pass them through to grab-site, the WARCS aren't signed in, and my session in the browser expires. How do I fix this?

TheTechRobo commented 3 years ago

Update: Actually no, maybe my session doesn't expire??

In any case, I still can't stay signed in by adding the cookies. Even after adding my user-agent.

systwi-again commented 2 years ago

thuban and I on #archiveteam-bs were troubleshooting this very issue a while ago. It seems to be an issue with wpull, where it does not import the cookies.txt file as instructed.

There is a workaround for the time being, however (steps written assuming you're using Firefox):

Launch a web browser that supports copying the cURL request of a loaded resource (e.g. Firefox)
Press F12 (or fn+F12 on some keyboards), or click on Tools > Web Developer > Toggle Tools to show the web developer toolbox
In the web developer toolbox, click the "Network" tab
Enter and load the website/webpage you wish to crawl in that browser tab/window (you need to be logged in, of course)
In the filter text box, paste in the same URL and click the "All" type filter button (or "HTML") 5.5. If nothing comes up, truncate the end of the URL slowly until you see something appear in the list
Right click (or Control-click on macOS) on the entry and click Copy > Copy as cURL
Paste the clipboard data into a new text document and look for -H 'Cookie: ....... If you don't see this, try choosing a different entry in the list
Remove everything else from that curl query, keeping only the entire cookie entry (Cookie: ......)

Craft your grab-site query in the text file like the example below:


#!/bin/bash
~/gs-venv/bin/grab-site --1 --wpull-args='--header '"'"'Cookie: SESSIONID=848a0415-98c0-45fc-b281-b805e470b714; EXPIRE=1652325000'"'"' --keep-session-cookies' 'https://auth.example.com/home.html'


10. Save the text file, `chmod +x` it and run it. The page should then be saved using the provided cookies.

It's also worth noting that some cookies, primarily (or solely?) ones that begin with `#`, are newer (by Mozilla) and out of the Netscape spec, and thus are not supported by `grab-site`, `wpull` or even `wget` at the time of writing (1652325299). If your website happens to require such cookies, your crawl may not work at all. As an extra archival measure, I also export a cookies.txt using [this Firefox extension](https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/) and move it to the `grab-site` output directory when the crawl is complete. It's better than nothing, I suppose.

TheTechRobo commented 2 years ago

Cookies work for me most of the time. I've recently crawled Planet French which requires login. Infos-Ados didn't work, and I don't have the cookies anymore to check if there are #HttpOnly.

Yeah, the #HttpOnly ones gave me headaches in my DeviantArt scraper.

TomLucidor commented 8 months ago

@TheTechRobo wait DeviantArt? Isn't that the job of Gallery-DL or are you trying to get other things from them?

TheTechRobo commented 8 months ago

@TomLucidor I wasn't aware of DeviantArt back then.

ArchiveTeam / grab-site

Cookies not staying #187