N0taN3rd / Squidwarc

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
https://n0tan3rd.github.io/Squidwarc/
Apache License 2.0
168 stars 26 forks source link

Feature request: following links #42

Closed ghost closed 5 years ago

ghost commented 5 years ago

Are you submitting a bug report or a feature request?

Feature request

What is the current behavior?

Squidwarc can't config to following links if url is same domain as seeds.

What is the expected behavior?

Can config to following links if url is same domain as seeds.

machawk1 commented 5 years ago

What I interpret this as (and @Sian1468, please correct me if wrong) is an option to only follow links if the links are in the same domain.

Normally I would think this is covered by domain restriction but I think what is being suggested is to ignore the "hops" restriction and keep following any links that are in the same domain. This is somewhat similar to "archive the whole site" provided the entire site is inter-linked from the starting seed.

ghost commented 5 years ago

What I interpret this as (and @Sian1468, please correct me if wrong) is an option to only follow links if the links are in the same domain.

Normally I would think this is covered by domain restriction but I think what is being suggested is to ignore the "hops" restriction and keep following any links that are in the same domain. This is somewhat similar to "archive the whole site" provided the entire site is inter-linked from the starting seed.

You correct @machawk1

I think Squidwarc can do more than capture from depth setting by capture whole site with single/by depth page's offsite links or without offsite links setting.

N0taN3rd commented 5 years ago

@Sian1468 thanks you for suggesting this and I believe your suggestion would be an excellent feature for Squidwarc.

I will be putting some thought into how to accomplish this nicely alongside the existing crawl modes. Do you have any suggestions as to how you would like to be able to specify this crawl mode?

ghost commented 5 years ago

@Sian1468 thanks you for suggesting this and I believe your suggestion would be an excellent feature for Squidwarc.

I will be putting some thought into how to accomplish this nicely alongside the existing crawl modes. Do you have any suggestions as to how you would like to be able to specify this crawl mode?

Recursive crawl mode

I got inspiration from other archiving tools & software eg. Wpulll, grab-site and crocoite

N0taN3rd commented 5 years ago

implemented and merged into master PR #47