recrawling best practices

Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

https://opensource.norconex.com/crawlers

Apache License 2.0

184 stars 67 forks source link

recrawling best practices #329

Closed aleha84 closed 7 years ago

aleha84 commented 7 years ago

I'm using crawler with Elasticsearch Committer and want to know, how to configure recrawling correctly. Crawler and elastic runs under Windows system. Elastic running as windows service, crawler should be runned from scheduler once per day.
Maybe i don't understand some details, but in collectors workdir crawler stores all info about downloaded pages? And every next run it will check current data with previously downloaded content? Or some specific settings must be done in config.xml?

essiembre commented 7 years ago

You are correct. The collector will know what was previously crawled, and will check for additions/modifications/deletions and by default, will not send unmodified files. If you ever want recrawl from scratch, you can simply delete the working directory (or more precisely, the "crawlstore" directory) before running the collector again.

Scheduling is done externally with the method of your choice. Usually, it is best handled by the OS scheduler. In your case, that would be the Windows Task Scheduler.

aleha84 commented 7 years ago

just tested full indexing of my company site with this settings from my local dev system:

<delay default="150" ignoreRobotsCrawlDelay="true"></delay>
<numThreads>4</numThreads>
<maxDepth>-1</maxDepth>
<maxDocuments>-1</maxDocuments>

100% completed (160744 processed/160744 total) Crawler executed in 14 hours 36 minutes 2 seconds.

In my case i see 2 correct scenarios.

Full reindex once per week/month to detect if any page were modified.
"Sketchy" reindex from root page with 3 to find new entries in site sections each day or more often.

How crawler detects that document modified? Do i need some special configuring for it? If previously founded document deleted from site and crawler will not find it, i expect it should delete it from index to, is it expected behavior? Do i need for each described scenario separate config.xml file, or it is possible to implement with additional crawler section in single config. If so, ho to run crawlers by id?

essiembre commented 7 years ago

You do not need to do anything special. The default behavior will handle modifications and deletions. The collector internally stores a checksum for each document. That is how it knows if a document was modified or not on subsequent runs. If a document no longer exists, it will send to the committer a request for deletion (which will delete the doc form elasticsearch). The default document checksum implementation used is MD5DocumentChecksummer.

aleha84 commented 7 years ago

Which fields used by default to detect if document changed or not? Full indexing has taken 2,5 hours. It's fine. But first recrawling with with maxDepth = 2 got very few (or not at all) log entries started with REJECTED_UNMODIFIED Second time there were much more REJECTED_UNMODIFIED entries. But still a lot of DOCUMENT_COMMITTED_ADD I believe that i'm doing something wrong.

essiembre commented 7 years ago

By default it compares the body content (creating/caching a checksum of it). If some pages are dynamically generated with parts of their content being modified each time, they will be re-crawled. Look at MD5DocumentChecksummer for more details.

aleha84 commented 7 years ago

Is it possible to exclude some specified fields or tags? For example in head i got meta tag with always changed value, some annoying but not reducable thing. Another example is current date\time in page header. And because of this page is always considered different. Modifying site for crawler is a bit wrong.

essiembre commented 7 years ago

By default it is not using the metadata fields, just the extracted content. Here are a few options to get around your issue:

If you do not want to capture the text that keeps changing, it means it has no value to you, so I recommend you simply strip it from the content. Given you have a pattern you can use, have a look at transformer classes such as ReplaceTransformer and StripBetweenTransformer
You can also use the <sourceFields> tag from the MD5DocumentChecksummer if you have a field that can be used instead (like a meta field with a last-updated date).
If the Last-Modified date from the HTTP header is reliable, and want the non modified ones to be rejected before they are downloaded, you can also consider using the GenericMetadataChecksummer.

aleha84 commented 7 years ago

StripBetweenTransformer is a good option.

essiembre commented 7 years ago

Glad that works for you.