ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

Best way to grab this page? #155

Open sardaukar opened 4 years ago

sardaukar commented 4 years ago

I am trying to grab https://www.versionmuseum.com/history-of/classic-mac-os and having difficulty understanding grab-site. If I just do a normal grab with the URL, it starts crawling half the internet. If I do --no-offsite-links it still grabs everything on the website. With --1 it's quick, but the images on the page only work if clicked on, and are broken otherwise. With --level=1 I get a 206MB WARC file.

What's the best way to capture this page, and only this page (none of the Amazon or Word history links on the sidebar) and still have working thumbnails?

cherry-vanilla commented 4 years ago

Hey sardaukar, I run versionmuseum.com. What are you trying to do?

sardaukar commented 4 years ago

Trying to save that page about MacOS history locally. I get scared about interesting websites going down and wanted to preserve this one in my WARC collection. Isn't that what this project is for?

cherry-vanilla commented 4 years ago

Gotcha. Have no fear, we just re-launched the site a couple months ago -- it's gonna be around for a long time. We put the Mac OS page up only like a month ago. Hope you enjoyed the trip down memory lane!

sardaukar commented 4 years ago

Yeah, but my point was on how to use grab-site to capture the page. Maybe this website won't go away anytime soon, but others might and they have pictures and I just wanted to know how to best use it.

sardaukar commented 4 years ago

So there's no easy way to use grab-site to archive this page, then? Just so I can close the issue.

TheTechRobo commented 3 years ago

Did you end up resolving this issue? If not, large WARC files are typical, even on seemingly small websites. I've had to stop many a crawl because my internet is too slow to upload more than 150mb to internet archive and I don't have good permanent storage.

YOu can try manually blocking the pages in question: image.