Closed mariuspruski closed 5 years ago
The latest release has this fix in. Can you try it and confirm it works for you?
Thanks for addressing the issue; unfortunately I can't test it right now, as I'm off the project.
No problem. I will close then. Thanks for your fix.
My issue: I'm using a
DOMSplitter
to create several documents out of one reference, and these child documents will always appear as modified because their cached crawl data is never loaded from the store.My goal: Child documents should not be added if they weren't modified since the last crawler run.
Additional info: I have invested in the Norconex core and noticed that
cachedCrawlData
which is only loaded one time for the parent reference: SeeAbstractCrawler.java::517
. ThiscachedCrawlData
is passed on unchanged also for all child documents. However, it contains no checksums for the child documents, there is nooldChecksum
available for their references, that's why the checksum comparison always returns that the document has been modified and needs to be addedSolution suggestion: I manage to fix this issue (and produce the desired result) when I change code in
AbstractCrawler.java::609
to the following:Meaning that for every processed child document before processing its import response I load the
cachedCrawlData
for this child's reference (under which it was saved in the last run).