Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Documents marked for Deletion if their URL is not responding #792

Closed ciroppina closed 2 years ago

ciroppina commented 2 years ago

Hi, I got this problem with my Collector Http 2.9.0 installation: a) collector crawls with a 1 day delay b) keepDownloads is false to save disk space c) collector only crawls urls listed in a text file (1 url per line, they are documents, such as: pdf, docx) the problem: d) when a URL is temporarily unavailable Norconex marks it for deletion e) even setting "spoiledReferenceStrategizer" for ignoring BAD_STATUS, NOT_FOUND, ERROR, the next crawling cycle Norconex considers the document as a NEW one, reimporting it and sending it to the committer (IdolCommitter for CFS)

Is there a way to "tell" Norconex Collector that an unavailable URL is not a Document deletion? thank You

essiembre commented 2 years ago

Is there a way to "tell" Norconex Collector that an unavailable URL is not a Document deletion?

I can think of a few ways.

Do not keep crawl history:

With version 2.9.0, you can delete the crawl store before each run. That way you won't get any deletions as it will not have a crawl history to compare to between crawler runs.

With version 3, the equivalent would be to always launch with the -clean command-line flag.

Turn some HTTP error codes into "valid" codes:

If you are using the GenericDocumentFetcher (or GenericHttpFetcher with v3), you can specify which HTTP response code should be treated as valid codes.

  <documentFetcher
      class="com.norconex.collector.http.fetch.impl.GenericDocumentFetcher"
      detectContentType="[false|true]" detectCharset="[false|true]">
    <validStatusCodes>200</validStatusCodes>
    <notFoundStatusCodes>404</notFoundStatusCodes>
  </documentFetcher>

The above are the default codes used. Anything else can potentially be interpreted as an error. You can add more to the valid status codes list, comma-separated.

ciroppina commented 2 years ago

Thank You Pascal, "That way you won't get any deletions as it will not have a crawl history to compare to between crawler runs" I dont want already imported documents to be re-imported every run

"you can specify which HTTP response code should be treated as valid codes" This seems to be the good way to solve my problem - I will check it as soon as I can; I guess the following setting, in order to consder 404 or 500 as valid codes: <validStatusCodes>200,404,500</validStatusCodes>

ciroppina commented 2 years ago

My strong desire is:

how to combine together these two requirements?

essiembre commented 2 years ago

To "clean" a repo with 2.9.x, you have little choice but to delete the crawl store (or the entire "workdir" folder). That would address your first bullet. Using "IGNORE" or "GRACE_ONCE" for spoiledReferenceStrategizer should also do it. The main differences:

Deleting the crawl store: The crawler will consider all URLs encountered as new after that.

Ignoring/gracing bad URLs: With IGNORE, if not mistaken, the document is not deleted from your index but will be removed from the cache. Same with GRACE_ONCE if the URL is not responding for more than one consecutive crawling session. That means that yes, the moment a URL is available again, it will be considered "new" and will be recrawled.

The second scenario should normally suffice. The crawler often has to download a file again anyway to find if it has changed. The difference is whether it resends it to the committer (IDOL in your case) or not. Given it is only the not responding URLs that would be resent again, and they have the same URL as before, a few more documents sent to IDOL should not make much of an impact, as it should overwrite any existing entry, so it is not like it is creating duplicates or anything like that.

What is the issue in your case of having a few valid documents be resent to IDOL when they are back online?

ciroppina commented 2 years ago

"What is the issue in your case of having a few valid documents be resent to IDOL when they are back online?"

Dear Pascal, our issue with "having a few valid documents be resent to IDOL" is that hte number of resent document could be 5000 to 10000, because we use the Norconex Http collector against a document repository containing upto 1-MLN documents

ciroppina commented 2 years ago

Is there any advantage by switching to version 3.0.x of Your Http crawler? We not-only need to IGNORE or GRACE_ONCE temporarily offline URLs, we need to avoid re-import thousands documents also (this is the case when a "portion/namespace" of our Hitachi HCP Document Repo goes offline for a time)

ciroppina commented 2 years ago

any answer?

ciroppina commented 2 years ago

Yetserday I tried to use the ORPHAN Strategy: <orphansStrategy>IGNORE</orphansStrategy> I hoped that would solve my problem, indeed "Orphans are valid documents, which on subsequent crawls can no longer be reached when running the crawler". But when documents that were Orphan are again online, the crawler import them as NEW. Should not it "remember" they have been imported before?

essiembre commented 2 years ago

Glad you found a way. The crawl cache only keeps traces of documents from their last session so that they can be compared with the next session. It does not keep a crawl "history" for each document. That is why if it falls off the radar, it will be considered new next time it is encountered.

To keep track of such history, a few ideas:

  1. Look at implementing your own ICrawlDataStoreFactory (or modify an existing one) and somehow always keep old records (the interface is IDataStoreEngine under version 3.x).
  2. Overwrite (modify existing) IDocumentChecksummer to rely on custom storage of older documents for checksum comparison.
stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.