Collect only OK,(200,302) pages

bartman081523 commented 5 years ago

Hello Mr. Buren, i propose another pull request, i hope you can approve.

The Archiver collects every page, means it tries to collect all 404 pages also. (Yes, i regret that i had not thoroughly tested the #24 pull request for every eventualities, but now i did) The spider.every_ok_page ensures, the Archiver does not collect 404 pages. I have tested and it collects 200 (OK) and also 302 (Redirected) pages. As i use this very extensively, the 404 with invalid urls had stacked up for some sites like /typo3/.../typo3/.../typo3/ and so the 404 collection had slowed down the actual archiving task of this script or even render the archiving of a whole page unsuccessful.

Off Topic Enhancement 1 Also, i want to work on a command-line switch, that make the crawl archiving from off-domain content through this script possible to describe off-domain files as single url targets for this script and hand them to the Internet Archive. Because i have certain pages, that are only working cross-domain. What i want to create, is an archive or a collection of public broadcasting videos from some countrys. The html video page runs on one domain, while the videos are streamed from another domain.

Off Topic Enhancement 2 What i consider thoughtful would be to have a switch to spider the ressources of one page first, to ensure that the page runs before the following links are spidered. I will look in to the spidr documentation. I have opened an issue for wayback machine because the rewrite of the domains and the Same Origin Policy and CORB is yet still a problem for wayback machine. Also #15 would be the same approach for spidr. Off Topic Enhancement 3 I have also the Idea to scrape the pages through chrome headless, like https://github.com/machinio/cuprite but nvm i make only suggestions.

buren commented 5 years ago

Hi @chlorophyll-zz, thank you for your PR!

It seems to me like this would only archive pages that return HTTP 200. every_ok_page is implemented in agent/events.rb#L97 and uses page.ok? implemented here. It checks code == 200.

I think we should post all 200 and all redirects to the archive. Something like

spider.every_page do |page|
  next if !page.ok? && !page.redirect?
  # ...
end

As for "Off Topic Enhancement 1" I've implemented a command line option --hosts that enables you to crawl multiple domains. See https://github.com/buren/wayback_archiver/pull/27

Crawl justarrived.se, app.justarrived.se and careers.justarrived.se

wayback_archiver https://justarrived.se --hosts=app.justarrived.se,careers.justarrived.se

bartman081523 commented 5 years ago

@buren great, thank you for your effort. So, here is an example from what your work is capable of. All videos and scripts are archived permanently in the IA WBM. (We have the problem of de-publicizing of public broadcasting services.) (yes, this contradicts the term public in pbs completely. but now to our luck, the IA does not have to follow some contradicting industrial regulations)

http://web.archive.org/web/20190404135701/https://www.arte.tv/de/videos/083585-000-A/suechtig-nach-schmerzmitteln/

What i still have to do is disable the Content-Security-Policy with this chrome extension. https://chrome.google.com/webstore/detail/disable-content-security/ieelmcmcagommplceebfedjlakkhpden/related

So I have addressed the rewriting Issues with CORB, CSP and relative sources (src=/scripts/abc.js) to the Maintainers of WBM.

To be completely honest, i used a few hacks:

alias wba=/home/user/.gems/ruby/2.6.0/bin/wayback_archiver short alias for wayback archiver binary gem

wget www.example.com/video1/index.html download the source html page

cat index.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort | uniq >> ressources.txt put all links from ressources etc in a file ressources.txt

cat ressources.txt | while read CMD; do wba --url "$CMD"; done hand all lines from ressources.txt to wayback_archiver as url archival targets for WBM.

From that perspective i suggested Offtopic Enhancement 1 and -2

buren / wayback_archiver

Collect only OK,(200,302) pages #26