Closed fin-atem closed 5 years ago
Ah, earlier I thought grab-site/wpull was actually grabbing URLs with the session id. But it looks like you're following a link in a WARC player on a page that included URLs with a ?sid=
. grab-site/wpull won't rewrite the responses themselves to strip the session IDs; that would be fabricating responses that it did not actually observe.
Two things that might help:
Start the crawl on another page; it looks like the forum software stops including ?sid=
in the links if a cookie is set.
Add session id stripping support to the WARC player (perhaps by checking for matching URLs that don't include the ?sid=
.)
my apologies, thank you for the help. closed.
--
grab-site
1.8.0 under macOS High Sierra 10.13.6 with python 3.4--attempting to run:
$ grab-site --1 "http://ataristeven.exxoshost.co.uk/" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513&start=10" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513&start=20"
stalls for a long time near the end of the grab. it eventually completes after a few minutes with an image failing to be captured. (this one)specifying the user agent:
$ grab-site --1 --ua="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13; ) Gecko/20101203" "http://ataristeven.exxoshost.co.uk/" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513&start=10" "https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513&start=20"
allows the crawl to complete in roughly 20 seconds with that image downloaded just fine.my assumption is that the image issue probably has to do with the website and has nothing to do with
grab-site
.the issue? try running that .warc (either one) through the mac version of webarchive player 1.0.9.311 (thats what i did). click on the first forum page (https://www.exxoshost.co.uk/forum/viewtopic.php?f=13&t=513), then the second (number is near the top of the page). look at the url bar. i see a session id ☹️