Open jtele2 opened 2 months ago
Hi @jtele2 , thanks for filing.
Is the expectation in your use case that any page, once crawled, will never change? The current behavior is intended to cover scenarios where websites have changed between one crawl and the next - either through new pages being added or existing pages being modified, or both. Without fetching the contents of a given page, the crawler would not be able to be certain that the page had not changed since the last crawl.
@seanstory yes sir - understood that many websites change, but take for example specific news articles, blogs, etc, which don't change much after published.
And if they do change, we accept the risk (usually they're small modifications or updates).
Example: example.com
published article 00001
at https://example.com/articles/00001
We still want to crawl https://example.com/articles
, but I don't want to waste time pulling and indexing article 00001
a second time.
Problem Description
I think it would be valuable to have an option to avoid duplicate crawls across runs. E.g., check an index to see if the given url has already been crawled - if so, don't crawl it again.
Proposed Solution
Something to the effect of:
Now the above should check the
<index_name>
index for URLs that have been crawled already.Alternatives
Maybe I could get a list of all the URLs that have been crawled and pass them into the crawl rules (specified in
CRAWL_RULES.md
) as disallowed crawls.Additional Context
N/A