Norconex / collector-core

Collector-related code shared between different collector implementations
http://www.norconex.com/collectors/collector-core/
Apache License 2.0
7 stars 15 forks source link

No CachedCrawlData loaded for nested importer responses #23

Closed mariuspruski closed 5 years ago

mariuspruski commented 6 years ago

My issue: I'm using a DOMSplitter to create several documents out of one reference, and these child documents will always appear as modified because their cached crawl data is never loaded from the store.

My goal: Child documents should not be added if they weren't modified since the last crawler run.

Additional info: I have invested in the Norconex core and noticed that

Solution suggestion: I manage to fix this issue (and produce the desired result) when I change code in AbstractCrawler.java::609 to the following:

        for (ImporterResponse child : children) {
            BaseCrawlData embeddedCrawlData = createEmbeddedCrawlData(
                    child.getReference(), crawlData);
+++         cachedCrawlData = (BaseCrawlData)crawlDataStore.getCached(child.getReference())
            processImportResponse(
                    child, crawlDataStore, embeddedCrawlData, cachedCrawlData);
        }

Meaning that for every processed child document before processing its import response I load the cachedCrawlData for this child's reference (under which it was saved in the last run).

essiembre commented 5 years ago

The latest release has this fix in. Can you try it and confirm it works for you?

mariuspruski commented 5 years ago

Thanks for addressing the issue; unfortunately I can't test it right now, as I'm off the project.

essiembre commented 5 years ago

No problem. I will close then. Thanks for your fix.