Closed ciroppina closed 2 years ago
Is there a way to "tell" Norconex Collector that an unavailable URL is not a Document deletion?
I can think of a few ways.
Do not keep crawl history:
With version 2.9.0, you can delete the crawl store before each run. That way you won't get any deletions as it will not have a crawl history to compare to between crawler runs.
With version 3, the equivalent would be to always launch with the -clean
command-line flag.
Turn some HTTP error codes into "valid" codes:
If you are using the GenericDocumentFetcher
(or GenericHttpFetcher
with v3), you can specify which HTTP response code should be treated as valid codes.
<documentFetcher
class="com.norconex.collector.http.fetch.impl.GenericDocumentFetcher"
detectContentType="[false|true]" detectCharset="[false|true]">
<validStatusCodes>200</validStatusCodes>
<notFoundStatusCodes>404</notFoundStatusCodes>
</documentFetcher>
The above are the default codes used. Anything else can potentially be interpreted as an error. You can add more to the valid status codes list, comma-separated.
Thank You Pascal, "That way you won't get any deletions as it will not have a crawl history to compare to between crawler runs" I dont want already imported documents to be re-imported every run
"you can specify which HTTP response code should be treated as valid codes"
This seems to be the good way to solve my problem - I will check it as soon as I can; I guess the following setting, in order to consder 404 or 500 as valid codes:
<validStatusCodes>200,404,500</validStatusCodes>
My strong desire is:
how to combine together these two requirements?
To "clean" a repo with 2.9.x, you have little choice but to delete the crawl store (or the entire "workdir" folder). That would address your first bullet. Using "IGNORE" or "GRACE_ONCE" for spoiledReferenceStrategizer
should also do it. The main differences:
Deleting the crawl store: The crawler will consider all URLs encountered as new after that.
Ignoring/gracing bad URLs: With IGNORE, if not mistaken, the document is not deleted from your index but will be removed from the cache. Same with GRACE_ONCE if the URL is not responding for more than one consecutive crawling session. That means that yes, the moment a URL is available again, it will be considered "new" and will be recrawled.
The second scenario should normally suffice. The crawler often has to download a file again anyway to find if it has changed. The difference is whether it resends it to the committer (IDOL in your case) or not. Given it is only the not responding URLs that would be resent again, and they have the same URL as before, a few more documents sent to IDOL should not make much of an impact, as it should overwrite any existing entry, so it is not like it is creating duplicates or anything like that.
What is the issue in your case of having a few valid documents be resent to IDOL when they are back online?
"What is the issue in your case of having a few valid documents be resent to IDOL when they are back online?"
Dear Pascal, our issue with "having a few valid documents be resent to IDOL" is that hte number of resent document could be 5000 to 10000, because we use the Norconex Http collector against a document repository containing upto 1-MLN documents
Is there any advantage by switching to version 3.0.x of Your Http crawler? We not-only need to IGNORE or GRACE_ONCE temporarily offline URLs, we need to avoid re-import thousands documents also (this is the case when a "portion/namespace" of our Hitachi HCP Document Repo goes offline for a time)
any answer?
Yetserday I tried to use the ORPHAN Strategy:
<orphansStrategy>IGNORE</orphansStrategy>
I hoped that would solve my problem, indeed "Orphans are valid documents, which on subsequent crawls can no longer be reached when running the crawler".
But when documents that were Orphan are again online, the crawler import them as NEW. Should not it "remember" they have been imported before?
Glad you found a way. The crawl cache only keeps traces of documents from their last session so that they can be compared with the next session. It does not keep a crawl "history" for each document. That is why if it falls off the radar, it will be considered new next time it is encountered.
To keep track of such history, a few ideas:
ICrawlDataStoreFactory
(or modify an existing one) and somehow always keep old records (the interface is IDataStoreEngine
under version 3.x).IDocumentChecksummer
to rely on custom storage of older documents for checksum comparison.This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi, I got this problem with my Collector Http 2.9.0 installation: a) collector crawls with a 1 day delay b) keepDownloads is false to save disk space c) collector only crawls urls listed in a text file (1 url per line, they are documents, such as: pdf, docx) the problem: d) when a URL is temporarily unavailable Norconex marks it for deletion e) even setting "spoiledReferenceStrategizer" for ignoring BAD_STATUS, NOT_FOUND, ERROR, the next crawling cycle Norconex considers the document as a NEW one, reimporting it and sending it to the committer (IdolCommitter for CFS)
Is there a way to "tell" Norconex Collector that an unavailable URL is not a Document deletion? thank You