Closed alok-gupta-sada closed 2 years ago
Rejected documents are not sent for deletions. Only documents that no longer exist, or "orphan" ones. You could think your <orphansStrategy>DELETE</orphansStrategy>
would do it, but orphans are URLs that are no longer being referenced. In your case, those URLs are still being referenced, simply rejected (i.e., they will not be sent to your search engine).
This is related to feature request #211. If you think it is the same request, I will mark this one as a duplicate.
Thanks for looking into this issue.
My understanding is that the HTTP collector will honor the robots.txt entries.
The crawler should always check the document against the updated robots.txt entries irrespective of their modification state and send a delete call for indexed documents. The crawler should log the event REJECTED_ROBOTS_TXT instead of REJECTED_UNMODIFIED in the logs.
I have a couple of questions:
/ Alok
Right now URLs rejected by the robots.txt are simply not processed at all (they are ignored/skipped). This greatly improves performance for many crawls. Comparing every URLs against URLs previously crawled would mean querying the crawl store for every URL encountered, just in case there was a change to the robots.txt. That is currently not offered out-of-the-box. I will mark this as a feature request to provide the option to do so.
To answers your questions:
- Does the robots.txt check is performed before fetching the document for content modification?
Yes, before. A document will not be downloaded if rejected by robots.txt (default behavior).
- Does the robots.txt match is done against the original URL or the normalized URL?
I suggest you refer to the following flow diagram to get a better understanding of various task execution order: https://www.norconex.com/collectors/collector-http/flow
Thanks for your response and for marking it as a feature request.
/ Alok
With version 3.0.0 it is now possible to send deletion requests to your Committer(s) by listening to rejection events with DeleteRejectedEventListener
. Have a look at https://github.com/Norconex/collector-http/issues/211#issuecomment-927245122 for an example.
@essiembre
Hi Pascal
Norconex version: 2.9.0-SNAPSHOT
We have encountered an issue where new additions to robots.txt file are not honored by Norconex crawler. The new disallows are not being removed from the index. I am using Norconex 2.9.0.snapshot version. The initial run did honor the robots.txt file and rejected the documents that were specified in the robots.txt file.
Later, we added 2 more disallows to the robots.txt file -
The crawler logs indicated these 2 documents are not modified and ignored them.
norconex configuration