ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.32k stars 130 forks source link

grab-site grabs urls with session id against my will #132

Closed fin-atem closed 5 years ago

fin-atem commented 5 years ago

--grab-site 1.8.0 under macOS High Sierra 10.13.6 with python 3.4--

attempting to run: $ grab-site --1 "http://ataristeven.exxoshost.co.uk/" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513&start=10" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513&start=20" stalls for a long time near the end of the grab. it eventually completes after a few minutes with an image failing to be captured. (this one)

specifying the user agent: $ grab-site --1 --ua="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13; ) Gecko/20101203" "http://ataristeven.exxoshost.co.uk/" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513&start=10" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513&start=20" allows the crawl to complete in roughly 20 seconds with that image downloaded just fine.

my assumption is that the image issue probably has to do with the website and has nothing to do with grab-site.

the issue? try running that .warc (either one) through the mac version of webarchive player 1.0.9.311 (thats what i did). click on the first forum page (https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513), then the second (number is near the top of the page). look at the url bar. i see a session id ☹️

ivan commented 5 years ago

Ah, earlier I thought grab-site/wpull was actually grabbing URLs with the session id. But it looks like you're following a link in a WARC player on a page that included URLs with a ?sid=. grab-site/wpull won't rewrite the responses themselves to strip the session IDs; that would be fabricating responses that it did not actually observe.

Two things that might help:

Start the crawl on another page; it looks like the forum software stops including ?sid= in the links if a cookie is set.

Add session id stripping support to the WARC player (perhaps by checking for matching URLs that don't include the ?sid=.)

fin-atem commented 5 years ago

my apologies, thank you for the help. closed.