Open TheTechRobo opened 3 years ago
Update: Actually no, maybe my session doesn't expire??
In any case, I still can't stay signed in by adding the cookies. Even after adding my user-agent.
thuban and I on #archiveteam-bs were troubleshooting this very issue a while ago. It seems to be an issue with wpull
, where it does not import the cookies.txt file as instructed.
There is a workaround for the time being, however (steps written assuming you're using Firefox):
-H 'Cookie: ......
. If you don't see this, try choosing a different entry in the listcurl
query, keeping only the entire cookie entry (Cookie: ......
)grab-site
query in the text file like the example below:
#!/bin/bash
~/gs-venv/bin/grab-site --1 --wpull-args='--header '"'"'Cookie: SESSIONID=848a0415-98c0-45fc-b281-b805e470b714; EXPIRE=1652325000'"'"' --keep-session-cookies' 'https://auth.example.com/home.html'
10. Save the text file, `chmod +x` it and run it. The page should then be saved using the provided cookies.
It's also worth noting that some cookies, primarily (or solely?) ones that begin with `#`, are newer (by Mozilla) and out of the Netscape spec, and thus are not supported by `grab-site`, `wpull` or even `wget` at the time of writing (1652325299). If your website happens to require such cookies, your crawl may not work at all. As an extra archival measure, I also export a cookies.txt using [this Firefox extension](https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/) and move it to the `grab-site` output directory when the crawl is complete. It's better than nothing, I suppose.
Cookies work for me most of the time. I've recently crawled Planet French which requires login. Infos-Ados didn't work, and I don't have the cookies anymore to check if there are #HttpOnly.
Yeah, the #HttpOnly ones gave me headaches in my DeviantArt scraper.
@TheTechRobo wait DeviantArt? Isn't that the job of Gallery-DL or are you trying to get other things from them?
@TomLucidor I wasn't aware of DeviantArt back then.
I'm trying to archive infos-ados.com and after I get the cookies (to stay signed in), and pass them through to grab-site, the WARCS aren't signed in, and my session in the browser expires. How do I fix this?