Question: How far does archivebox traverse?

ArchiveBox / ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

https://archivebox.io

MIT License

22.48k stars 1.19k forks source link

Question: How far does archivebox traverse? #333

Closed vext01 closed 4 years ago

vext01 commented 4 years ago

Hi,

I've recently discovered archivebox -- what a neat tool!

To try it out, I ran it on my personal website and was surprised to find that it followed links outside of my website too!

So my question is: How many links does it follow before stopping? Can this be controlled in any way?

Thanks!

P.S. I'm an OpenBSD developer. If you can get this up on PyPI, I'll happily make a port so that archivebox can be in the package manager.

pirate commented 4 years ago

If you pipe a link in via stdin it archives just that link, if you pass a URL as an arg it interprets it as a source to import other links from.

See:

https://github.com/pirate/ArchiveBox/wiki/Usage#import-a-single-url-or-list-of-urls-via-stdin

https://github.com/pirate/ArchiveBox/wiki/Usage#import-list-of-links-exported-from-browser-or-another-service

This difference in behavior is intentional but not intuitive, so it's been changed in the upcoming v0.4 archivebox add CLI design.

Thanks for the offer re: OpenBSD! If you want to subscribe to PR #207 you'll get an update when v0.4 ships on PyPI.

vext01 commented 4 years ago

If you pass a URL as an arg it interprets it as a source to import other links from.

I see. That indeed isn't intuitive. The new CLI makes much more sense. Looking forward to that!

So with the current design, if I pass a URL as an arg, it follows links 1 deep. Is that correct?

If you want to subscribe to PR #207 you'll get an update when v0.4 ships on PyPI.

Many thanks. Subscribed.

pirate commented 4 years ago

In a sense it follows one link deep, but that's not really what you want if you're looking for recursive archiving since it doesn't archive the original URL. What it's really doing is treating the path/link argument as a feed to import a list of links from, e.g. a browser history or pinboard export.