Are Incremental web crawls supported?

davidfordaus commented 7 years ago

Hi there - does Norconex support incremental web crawls that would:

Download headers first using the PROPFIND HTTP call to find the last modified date
Lookup the date / time of the previous crawl of the page
IF the PROP change date is more recent THEN
- Download the page, update the database as normal
At the end of a crawl delete (from the index / committer) any pages now not found

essiembre commented 7 years ago

The short answer is yes, your objective is covered, with minor differences.

Web crawling is done with the HTTP Collector (which uses Collector Core as a library).

Download headers first using the PROPFIND HTTP call to find the last modified date

By default, it does not do a separate call just to find the date to prevent creating an extra HTTP call. This can be enabled by configuring a <metadataFetcher>. You can create your own implementation or use the existing GenericMetadataFetcher which relies on an HTTP HEAD request to find the last modified date (same effect as PROPFIND).

Lookup the date / time of the previous crawl of the page IF the PROP change date is more recent THEN Download the page, update the database as normal

Done by default. This is the job of the <metadataChecksummer> which default implementation is LastModifiedMetadataChecksummer.

If you do not configure a metadata fetcher, it will still query the HTTP headers for the last modified date, but when it downloads the document instead (HTTP GET).

At the end of a crawl delete (from the index / committer) any pages now not found

Done by default. Files that are no longer found fire a deletion request by default. You can also configure what to do with files that are no longer "linked" from anywhere in your crawl (delete them, crawl them, etc.).

Have a look at configuration documentation, and the URL crawling flow diagram, which may answer more of your questions.

danizen commented 6 years ago

@davidfordaus, I do this in my implementation, and can verify that it works for me. For a smaller crawl, it may be best to recrawl to a new index, and then trash the old index. This is natural with elasticsearch, but not quite so natural with Solr, although it still works.

Norconex / collector-core

Are Incremental web crawls supported? #11