ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

Cookies not staying #187

Open TheTechRobo opened 3 years ago

TheTechRobo commented 3 years ago

I'm trying to archive infos-ados.com and after I get the cookies (to stay signed in), and pass them through to grab-site, the WARCS aren't signed in, and my session in the browser expires. How do I fix this?

TheTechRobo commented 3 years ago

Update: Actually no, maybe my session doesn't expire??

In any case, I still can't stay signed in by adding the cookies. Even after adding my user-agent.

systwi-again commented 2 years ago

thuban and I on #archiveteam-bs were troubleshooting this very issue a while ago. It seems to be an issue with wpull, where it does not import the cookies.txt file as instructed.

There is a workaround for the time being, however (steps written assuming you're using Firefox):

  1. Launch a web browser that supports copying the cURL request of a loaded resource (e.g. Firefox)
  2. Press F12 (or fn+F12 on some keyboards), or click on Tools > Web Developer > Toggle Tools to show the web developer toolbox
  3. In the web developer toolbox, click the "Network" tab
  4. Enter and load the website/webpage you wish to crawl in that browser tab/window (you need to be logged in, of course)
  5. In the filter text box, paste in the same URL and click the "All" type filter button (or "HTML") 5.5. If nothing comes up, truncate the end of the URL slowly until you see something appear in the list
  6. Right click (or Control-click on macOS) on the entry and click Copy > Copy as cURL
  7. Paste the clipboard data into a new text document and look for -H 'Cookie: ....... If you don't see this, try choosing a different entry in the list
  8. Remove everything else from that curl query, keeping only the entire cookie entry (Cookie: ......)
  9. Craft your grab-site query in the text file like the example below:
    
    #!/bin/bash
    ~/gs-venv/bin/grab-site --1 --wpull-args='--header '"'"'Cookie: SESSIONID=848a0415-98c0-45fc-b281-b805e470b714; EXPIRE=1652325000'"'"' --keep-session-cookies' 'https://auth.example.com/home.html'

10. Save the text file, `chmod +x` it and run it. The page should then be saved using the provided cookies.

It's also worth noting that some cookies, primarily (or solely?) ones that begin with `#`, are newer (by Mozilla) and out of the Netscape spec, and thus are not supported by `grab-site`, `wpull` or even `wget` at the time of writing (1652325299). If your website happens to require such cookies, your crawl may not work at all. As an extra archival measure, I also export a cookies.txt using [this Firefox extension](https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/) and move it to the `grab-site` output directory when the crawl is complete. It's better than nothing, I suppose.
TheTechRobo commented 2 years ago

Cookies work for me most of the time. I've recently crawled Planet French which requires login. Infos-Ados didn't work, and I don't have the cookies anymore to check if there are #HttpOnly.

Yeah, the #HttpOnly ones gave me headaches in my DeviantArt scraper.

TomLucidor commented 8 months ago

@TheTechRobo wait DeviantArt? Isn't that the job of Gallery-DL or are you trying to get other things from them?

TheTechRobo commented 8 months ago

@TomLucidor I wasn't aware of DeviantArt back then.