ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.32k stars 130 forks source link

Duplicate URLs with different Request headers not stored #100

Open atiro opened 7 years ago

atiro commented 7 years ago

This may be inherent in the WARC format, but we have a site that responds to a URL with either JSON or HTML depending on the request type (XmlHTTP or HTTP). In the WARC file after a crawl that retrieves both only one response is stored for that URL, which then causes problems when replaying the site. Any way to avoid this ?

ivan commented 7 years ago

Can you first check the .warc.gz file with zgrep -F URL FILE.warc.gz? You can use -C N to display more context lines. It's plausible that the WARC playback software you're using just doesn't show both responses, so I want to rule that out.

ivan commented 7 years ago

And do you know if the crawl actually made requests that would result in those JSON/XmlHTTP responses? grab-site/wpull isn't going to execute JavaScript (unless used with the phantomjs mode that I haven't even documented because it's so unreliable.)

atiro commented 7 years ago

Cheers for speedy reply. Only one URL in the .cdx recorded. Yes, if I only grab only the front page (with the XmlHTTP request) it's displaying fine.

Internet archive is having the same problem, see http://web.archive.org/web/20170427094225/https://www.vam.ac.uk/ (grey block obscuring the page is the wrong cached response).

ivan commented 7 years ago

Unfortunately, I don't think there's a way to grab the same URL twice with different request headers in the same crawl. The database in wpull assumes that one successful response is enough. The only thing I can think of is to run a second crawl and somehow force it to not make a certain class of requests.