Open ftc2 opened 1 year ago
May be of interest to you: https://replayweb.page/ can load WARCs and allow you to browse them. It works best on websites that don't heavily rely on JavaScript.
I'd suggest to use wpull on its own (grab-site is basically wpull but tuned for easier crawling) but the current state of wpull outside of wrappers like this is awful. :/
thanks. i'm familiar with replayweb, but warc is really not for me.
i want the option to be able to do things like:
it's just easier for me to work with files.
tbh, i would just use wget
, but i'm having problems with it staying logged in even when using the various cookie options. sigh
i've tried:
--load-cookies exported_from_firefox.txt --keep-session-cookies
--load-cookies exported_from_firefox.txt --keep-session-cookies --save-cookies exported_from_firefox.txt
neither works. any tips?
it's very frustrating because i've had luck using curl
with the same cookie file like this:
--cookie exported_from_firefox.txt --cookie-jar exported_from_firefox.txt
but curl has no crawling functionality.
Does grab-site work with the cookie issue?
Go into the exported_from_firefox.txt file and check for any #HtttpOnly lines. Those are a common problem with cookies.txt parsers as they aren't part of any official specification. I've had luck occasionally with removing the #HttpOnly
from the beginning of the line (don't do that for the dot though, I don't think) but your mileage may vary.
i was so super frustrated with trying to extract files from old WARCs from another project that i didn't even bother trying grab-site
without first determining that it could save plain files, haha. that's kind of a prerequisite for me now.
httrack
is starting to look like one of the only candidates at this point.
i'll look into your cookie tips and see if i can get wget
working first though since i'm already pretty familiar with wget.
at first glance, i think your #HttpOnly
tip fixed it for me. i'll stick with wget
for now until i need something more complex. many thanks.
@TheTechRobo Seconding this about plain HTML files but for the reason of plugging it into AI document parsers like Khoj or GPT4All, summarizing blogs and making personal assistants out of it is kinda lit.
i only want files, not warc.
can grab-site output regular files (like html and images) for me like wget can? (links must be converted to relative links)
side question: has anyone here actually had good results with getting files back out of warc? this wouldn't be such a big deal if that were possible. i've never seen a util that can exract files from warcs with 100% success rate (and it's usually insanely slow).
i've tried:
extracted.001
, and idk how to get past that