Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

SpoiledReferenceStrategizer not deleting documents after grace period #635

Closed dtcyad1 closed 5 years ago

dtcyad1 commented 5 years ago

I am currently having an issue with the SpoiledReferenceStrategizer behaviour. The default behaviour for BAD_STATUS is to GRACE_ONCE. My config file does not have this explicitly set in the file. Would that be an issue? I was assuming that we dont need to have the spoiled setting explicitly in the config file.

However, to test, i did include this:

<spoiledReferenceStrategizer class="com.norconex.collector.core.spoil.impl.GenericSpoiledReferenceStrategizer"
                      fallbackStrategy="DELETE">
                      <mapping state="NOT_FOUND"  strategy="DELETE" />
                   <mapping state="BAD_STATUS" strategy="GRACE_ONCE" />
                       <mapping state="ERROR"      strategy="GRACE_ONCE" />
</spoiledReferenceStrategizer>

To give some background, the norconex is run once a day, ie, it will start and stop after the . crawl.

On Day 1 It finds and adds the url.

Later on this url is unpublished.

One Day 2 - It finds the url with now a BAD_STATUS and prints this message:

DEBUG (AbstractCrawler.java:692) - website: this spoiled reference is being graced once (will be deleted next time if still spoiled): https://test.com/test

Day 3 - Nothing happens. It just detects that it is a bad status but does not get deleted

2019-08-29 14:42:42,092 [pool-1-thread-1] INFO  (CrawlerEventManager.java:67) -       REJECTED_BAD_STATUS: https://test.com/test (HttpFetchResponse [crawlState=BAD_STATUS, statusCode=403, reasonPhrase=Forbidden])
2019-08-29 14:42:42,093 [pool-1-thread-1] DEBUG (Pipeline.java:93) - Pipeline execution stopped at stage: com.norconex.collector.http.pipeline.importer.DocumentFetcherStage@1906ff3f
2019-08-29 14:42:42,097 [pool-1-thread-1] DEBUG (FileJobStatusStore.java:174) - Writing status file: workdir/progress/latest/status/website.job
2019-08-29 14:42:42,118 [pool-1-thread-1] DEBUG (FileJobStatusStore.java:174) - Writing status file: workdir/progress/latest/status/website.job
2019-08-29 14:42:42,124 [pool-1-thread-1] DEBUG (AbstractCrawler.java:423) - website: 00:00:01.124 to process: https://test.com/test
2019-08-29 14:42:42,125 [pool-1-thread-1] INFO  (AbstractCrawler.java:400) - website: Maximum documents reached: 1
2019-08-29 14:42:42,126 [website] INFO  (AbstractCrawler.java:351) - website: Deleting orphan references (if any)...
2019-08-29 14:42:42,127 [website] INFO  (AbstractCrawler.java:364) - website: Deleted 0 orphan references...
2019-08-29 14:42:42,127 [website] INFO  (AbstractCrawler.java:271) - website: Crawler finishing: committing documents.
2019-08-29 14:42:42,128 [website] INFO  (AbstractCrawler.java:277) - website: 1 reference(s) processed.
2019-08-29 14:42:42,128 [website] DEBUG (AbstractCrawler.java:279) - website: Removing empty directories
2019-08-29 14:42:42,129 [website] INFO  (CrawlerEventManager.java:67) -          CRAWLER_FINISHED

Can you please provide some info on how to correctly ensure the GRACE_ONCE is honored and the delete is issued after that?

Thanks

essiembre commented 5 years ago

If I understand you right, you want a formely "good" URL now returning a BAD_STATUS to be deleted right away? If so, you should be able to use the XML config you tried, but changing the BAD_STATUS line to DELETE:

<mapping state="BAD_STATUS" strategy="DELETE" />
dtcyad1 commented 5 years ago

Hi Pascal,

No , I want it to be deleted after the grace once has occurred. But it is not happening as you can see from the logs.

Thanks

On Aug 31, 2019, at 7:11 PM, Pascal Essiembre notifications@github.com wrote:

If I understand you right, you want a formely "good" URL now returning a BAD_STATUS to be deleted right away? If so, you should be able to use the XML config you tried, but changing the BAD_STATUS line to DELETE:

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Norconex/collector-http/issues/635?email_source=notifications&email_token=ACWAT2HIOEOUEVJPFNO55ULQHL3D5A5CNFSM4ISDS3SKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5TW35I#issuecomment-526872053, or mute the thread https://github.com/notifications/unsubscribe-auth/ACWAT2A43TQARCMXWFN2U33QHL3D5ANCNFSM4ISDS3SA .

essiembre commented 5 years ago

I was able to reproduce. A new snapshot release has been made with the fix. Please try and confirm. Make sure you do not have different versions of the same jars in your lib folder if you install over (a fresh install is recommended).

dtcyad1 commented 5 years ago

Hi Pascal,

The fix works great!! Thanks for a fast response - really appreciate that!!