is it possible to output regular files instead of warc?

ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

Other

1.4k stars 135 forks source link

is it possible to output regular files instead of warc? #228

Open ftc2 opened 1 year ago

ftc2 commented 1 year ago

i only want files, not warc.

can grab-site output regular files (like html and images) for me like wget can? (links must be converted to relative links)

side question: has anyone here actually had good results with getting files back out of warc? this wouldn't be such a big deal if that were possible. i've never seen a util that can exract files from warcs with 100% success rate (and it's usually insanely slow).

i've tried:

jwat-tools: seemed the best coded of the bunch but gave me nonsensical filenames like extracted.001, and idk how to get past that
warcat: slow and fails on many warcs
warc-extractor: the easiest to use of the bunch (it can hit a bunch of warcs in a single dir), but it's insanely slow, and it also fails on many warcs
the unarchiver: fails on some warcs

TheTechRobo commented 1 year ago

May be of interest to you: https://replayweb.page/ can load WARCs and allow you to browse them. It works best on websites that don't heavily rely on JavaScript.

I'd suggest to use wpull on its own (grab-site is basically wpull but tuned for easier crawling) but the current state of wpull outside of wrappers like this is awful. :/

ftc2 commented 1 year ago

thanks. i'm familiar with replayweb, but warc is really not for me.

i want the option to be able to do things like:

host the archive as static content on nginx
iterate over files to scrape content with certain tools

it's just easier for me to work with files.

tbh, i would just use wget, but i'm having problems with it staying logged in even when using the various cookie options. sigh

i've tried:

--load-cookies exported_from_firefox.txt --keep-session-cookies
--load-cookies exported_from_firefox.txt --keep-session-cookies --save-cookies exported_from_firefox.txt

neither works. any tips?

it's very frustrating because i've had luck using curl with the same cookie file like this:

--cookie exported_from_firefox.txt --cookie-jar exported_from_firefox.txt

but curl has no crawling functionality.

TheTechRobo commented 1 year ago

Does grab-site work with the cookie issue?

Go into the exported_from_firefox.txt file and check for any #HtttpOnly lines. Those are a common problem with cookies.txt parsers as they aren't part of any official specification. I've had luck occasionally with removing the #HttpOnly from the beginning of the line (don't do that for the dot though, I don't think) but your mileage may vary.

ftc2 commented 1 year ago

i was so super frustrated with trying to extract files from old WARCs from another project that i didn't even bother trying grab-site without first determining that it could save plain files, haha. that's kind of a prerequisite for me now.

httrack is starting to look like one of the only candidates at this point.

i'll look into your cookie tips and see if i can get wget working first though since i'm already pretty familiar with wget.

ftc2 commented 1 year ago

at first glance, i think your #HttpOnly tip fixed it for me. i'll stick with wget for now until i need something more complex. many thanks.

TomLucidor commented 1 year ago

@TheTechRobo Seconding this about plain HTML files but for the reason of plugging it into AI document parsers like Khoj or GPT4All, summarizing blogs and making personal assistants out of it is kinda lit.