Add option to not crawl URLs already crawled in an index

jtele2 commented 2 months ago

Problem Description

I think it would be valuable to have an option to avoid duplicate crawls across runs. E.g., check an index to see if the given url has already been crawled - if so, don't crawl it again.

Proposed Solution

Something to the effect of:

bin/crawler crawl config/cisa_cybersecurity_advisories.yaml \
    --es-config config/elasticsearch.yaml
    --no-duplicates-index <index_name>

Now the above should check the <index_name> index for URLs that have been crawled already.

Alternatives

Maybe I could get a list of all the URLs that have been crawled and pass them into the crawl rules (specified in CRAWL_RULES.md) as disallowed crawls.

Additional Context

N/A

seanstory commented 2 months ago

Hi @jtele2 , thanks for filing.

Is the expectation in your use case that any page, once crawled, will never change? The current behavior is intended to cover scenarios where websites have changed between one crawl and the next - either through new pages being added or existing pages being modified, or both. Without fetching the contents of a given page, the crawler would not be able to be certain that the page had not changed since the last crawl.

jtele2 commented 2 months ago

@seanstory yes sir - understood that many websites change, but take for example specific news articles, blogs, etc, which don't change much after published.

And if they do change, we accept the risk (usually they're small modifications or updates).

Example: example.com published article 00001 at https://example.com/articles/00001

We still want to crawl https://example.com/articles, but I don't want to waste time pulling and indexing article 00001 a second time.

elastic / crawler