Norconex / collector-core

Collector-related code shared between different collector implementations
http://www.norconex.com/collectors/collector-core/
Apache License 2.0
7 stars 15 forks source link

Are Incremental web crawls supported? #11

Closed davidfordaus closed 4 years ago

davidfordaus commented 6 years ago

Hi there - does Norconex support incremental web crawls that would:

essiembre commented 6 years ago

The short answer is yes, your objective is covered, with minor differences.

Web crawling is done with the HTTP Collector (which uses Collector Core as a library).

Download headers first using the PROPFIND HTTP call to find the last modified date

By default, it does not do a separate call just to find the date to prevent creating an extra HTTP call. This can be enabled by configuring a <metadataFetcher>. You can create your own implementation or use the existing GenericMetadataFetcher which relies on an HTTP HEAD request to find the last modified date (same effect as PROPFIND).

Lookup the date / time of the previous crawl of the page IF the PROP change date is more recent THEN Download the page, update the database as normal

Done by default. This is the job of the <metadataChecksummer> which default implementation is LastModifiedMetadataChecksummer.

If you do not configure a metadata fetcher, it will still query the HTTP headers for the last modified date, but when it downloads the document instead (HTTP GET).

At the end of a crawl delete (from the index / committer) any pages now not found

Done by default. Files that are no longer found fire a deletion request by default. You can also configure what to do with files that are no longer "linked" from anywhere in your crawl (delete them, crawl them, etc.).

Have a look at configuration documentation, and the URL crawling flow diagram, which may answer more of your questions.

danizen commented 6 years ago

@davidfordaus, I do this in my implementation, and can verify that it works for me. For a smaller crawl, it may be best to recrawl to a new index, and then trash the old index. This is natural with elasticsearch, but not quite so natural with Solr, although it still works.