Closed jetnet closed 5 years ago
forgot to mention, that the orphan strategy is set to DELETE:
<orphansStrategy>DELETE</orphansStrategy>
Can you share a config that reproduces the issue? I tried the split as you did, which works fine (with DELETE orphans strategy). Then I re-ran it a few times, and if the containing document stays the same (unmodified), the children also appear unmodified. Relevant snippet:
INFO [CrawlerEventManager] REJECTED_UNMODIFIED: http://localhost/crawl-tests/
INFO [CrawlerEventManager] REJECTED_UNMODIFIED: http://localhost/crawl-tests/!html > body > p:nth-child(3) > img
INFO [CrawlerEventManager] REJECTED_UNMODIFIED: http://localhost/crawl-tests/!html > body > p:nth-child(4) > img
INFO [CrawlerEventManager] REJECTED_UNMODIFIED: http://localhost/crawl-tests/!html > body > p:nth-child(5) > img
...
INFO [AbstractCrawler] crawler-test: Deleting orphan references (if any)...
INFO [AbstractCrawler] crawler-test: Deleted 0 orphan references...
strange... if it works for you, when it must be something wrong with my config. since I changed the image metadata extraction from DomSplitter to TikaLinkExtractor, the issue is no longer relevant. I'll re-open the ticket, if I encounter this again. Thanks!
hello Pascal,
I found an issue with DomSplitter, e.g.
The very first crawl works fine, I'm getting the children docs into the index, but when I start the same crawl again (no changes on web-server), the collector removes those children docs.
The desired behaviour: to delete them, when their parent is gone. Do you have any idea how to fix it? Thanks a lot!