Closed aleha84 closed 7 years ago
You are correct. The collector will know what was previously crawled, and will check for additions/modifications/deletions and by default, will not send unmodified files. If you ever want recrawl from scratch, you can simply delete the working directory (or more precisely, the "crawlstore" directory) before running the collector again.
Scheduling is done externally with the method of your choice. Usually, it is best handled by the OS scheduler. In your case, that would be the Windows Task Scheduler.
just tested full indexing of my company site with this settings from my local dev system:
<delay default="150" ignoreRobotsCrawlDelay="true"></delay>
<numThreads>4</numThreads>
<maxDepth>-1</maxDepth>
<maxDocuments>-1</maxDocuments>
100% completed (160744 processed/160744 total) Crawler executed in 14 hours 36 minutes 2 seconds.
In my case i see 2 correct scenarios.
How crawler detects that document modified? Do i need some special configuring for it? If previously founded document deleted from site and crawler will not find it, i expect it should delete it from index to, is it expected behavior? Do i need for each described scenario separate config.xml file, or it is possible to implement with additional crawler section in single config. If so, ho to run crawlers by id?
You do not need to do anything special. The default behavior will handle modifications and deletions. The collector internally stores a checksum for each document. That is how it knows if a document was modified or not on subsequent runs. If a document no longer exists, it will send to the committer a request for deletion (which will delete the doc form elasticsearch). The default document checksum implementation used is MD5DocumentChecksummer.
Which fields used by default to detect if document changed or not? Full indexing has taken 2,5 hours. It's fine. But first recrawling with with maxDepth = 2 got very few (or not at all) log entries started with REJECTED_UNMODIFIED Second time there were much more REJECTED_UNMODIFIED entries. But still a lot of DOCUMENT_COMMITTED_ADD I believe that i'm doing something wrong.
By default it compares the body content (creating/caching a checksum of it). If some pages are dynamically generated with parts of their content being modified each time, they will be re-crawled. Look at MD5DocumentChecksummer for more details.
Is it possible to exclude some specified fields or tags? For example in head i got meta tag with always changed value, some annoying but not reducable thing. Another example is current date\time in page header. And because of this page is always considered different. Modifying site for crawler is a bit wrong.
By default it is not using the metadata fields, just the extracted content. Here are a few options to get around your issue:
<sourceFields>
tag from the MD5DocumentChecksummer
if you have a field that can be used instead (like a meta field with a last-updated date). StripBetweenTransformer is a good option.
Glad that works for you.
I'm using crawler with Elasticsearch Committer and want to know, how to configure recrawling correctly. Crawler and elastic runs under Windows system. Elastic running as windows service, crawler should be runned from scheduler once per day.
Maybe i don't understand some details, but in collectors workdir crawler stores all info about downloaded pages? And every next run it will check current data with previously downloaded content? Or some specific settings must be done in config.xml?