apurvmishra99 / archiver

Archive all webpages in a webiste which are not already archived by archive.org
10 stars 1 forks source link

Capture doesn't actually capture a website. If it isn't already archived #3

Open Kreijstal opened 4 years ago

Kreijstal commented 4 years ago

if you look at line https://github.com/apurvmishra99/archiver/blob/07207f6c1571b9975c258e1db299a15482b81a12/archiver/archive.py#L71 you'll see that this just probes archive.org to know if the site exists, but it doens't actually tell archive.org to archive it. You need to send a POST request like this

curl "https://web.archive.org/save/https://example.com/ ^
  -H "authority: web.archive.org" ^
  -H "origin: https://web.archive.org" ^
  -H "content-type: application/x-www-form-urlencoded" ^
  -H "accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" ^
  -H "sec-fetch-site: same-origin" ^
  -H "sec-fetch-mode: navigate" ^
  -H "sec-fetch-user: ?1" ^
  -H "sec-fetch-dest: document" ^
  -H "referer: https://web.archive.org/save/" ^
  --data-raw "url=https^%^3A^%^2F^%^2Fexample.com^%^2F&capture_outlinks=on^&capture_all=on" 
apurvmishra99 commented 4 years ago

Thanks for bringing this up! It looks like wayback archive has changed the way they snapshot images again. It should be easy fix though, my only concenrn is that there is no id anymore for the archive, its the timestamp + the url. And that is in the cookies, so slightly more diffcult to access from requests but I will try to fix this.