Closed dtcyad1 closed 5 years ago
If I understand you right, you want a formely "good" URL now returning a BAD_STATUS to be deleted right away? If so, you should be able to use the XML config you tried, but changing the BAD_STATUS line to DELETE:
<mapping state="BAD_STATUS" strategy="DELETE" />
Hi Pascal,
No , I want it to be deleted after the grace once has occurred. But it is not happening as you can see from the logs.
Thanks
On Aug 31, 2019, at 7:11 PM, Pascal Essiembre notifications@github.com wrote:
If I understand you right, you want a formely "good" URL now returning a BAD_STATUS to be deleted right away? If so, you should be able to use the XML config you tried, but changing the BAD_STATUS line to DELETE:
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Norconex/collector-http/issues/635?email_source=notifications&email_token=ACWAT2HIOEOUEVJPFNO55ULQHL3D5A5CNFSM4ISDS3SKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5TW35I#issuecomment-526872053, or mute the thread https://github.com/notifications/unsubscribe-auth/ACWAT2A43TQARCMXWFN2U33QHL3D5ANCNFSM4ISDS3SA .
I was able to reproduce. A new snapshot release has been made with the fix. Please try and confirm. Make sure you do not have different versions of the same jars in your lib folder if you install over (a fresh install is recommended).
Hi Pascal,
The fix works great!! Thanks for a fast response - really appreciate that!!
I am currently having an issue with the SpoiledReferenceStrategizer behaviour. The default behaviour for BAD_STATUS is to GRACE_ONCE. My config file does not have this explicitly set in the file. Would that be an issue? I was assuming that we dont need to have the spoiled setting explicitly in the config file.
However, to test, i did include this:
To give some background, the norconex is run once a day, ie, it will start and stop after the . crawl.
On Day 1 It finds and adds the url.
Later on this url is unpublished.
One Day 2 - It finds the url with now a BAD_STATUS and prints this message:
DEBUG (AbstractCrawler.java:692) - website: this spoiled reference is being graced once (will be deleted next time if still spoiled): https://test.com/test
Day 3 - Nothing happens. It just detects that it is a bad status but does not get deleted
Can you please provide some info on how to correctly ensure the GRACE_ONCE is honored and the delete is issued after that?
Thanks