Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

DomSplitter and orphans deletion #609

Closed jetnet closed 5 years ago

jetnet commented 5 years ago

hello Pascal,

I found an issue with DomSplitter, e.g.

<splitter class="com.norconex.importer.handler.splitter.impl.DOMSplitter" selector="img" sourceCharset="UTF-8"/>

The very first crawl works fine, I'm getting the children docs into the index, but when I start the same crawl again (no changes on web-server), the collector removes those children docs.

INFO  [AbstractCrawler] localhost: Deleting orphan references (if any)...
INFO  [CrawlerEventManager] DOCUMENT_COMMITTED_REMOVE: http://localhost:88/ff-test.html!#pic1
INFO  [CrawlerEventManager] DOCUMENT_COMMITTED_REMOVE: http://localhost:88/ff-test.html!html > body > img:nth-child(5)
INFO  [CrawlerEventManager] DOCUMENT_COMMITTED_REMOVE: http://localhost:88/ff-test.html!html > body > img:nth-child(8)

The desired behaviour: to delete them, when their parent is gone. Do you have any idea how to fix it? Thanks a lot!

jetnet commented 5 years ago

forgot to mention, that the orphan strategy is set to DELETE:

<orphansStrategy>DELETE</orphansStrategy>
essiembre commented 5 years ago

Can you share a config that reproduces the issue? I tried the split as you did, which works fine (with DELETE orphans strategy). Then I re-ran it a few times, and if the containing document stays the same (unmodified), the children also appear unmodified. Relevant snippet:

INFO  [CrawlerEventManager]       REJECTED_UNMODIFIED: http://localhost/crawl-tests/
INFO  [CrawlerEventManager]       REJECTED_UNMODIFIED: http://localhost/crawl-tests/!html > body > p:nth-child(3) > img
INFO  [CrawlerEventManager]       REJECTED_UNMODIFIED: http://localhost/crawl-tests/!html > body > p:nth-child(4) > img
INFO  [CrawlerEventManager]       REJECTED_UNMODIFIED: http://localhost/crawl-tests/!html > body > p:nth-child(5) > img
...
INFO  [AbstractCrawler] crawler-test: Deleting orphan references (if any)...
INFO  [AbstractCrawler] crawler-test: Deleted 0 orphan references...
jetnet commented 5 years ago

strange... if it works for you, when it must be something wrong with my config. since I changed the image metadata extraction from DomSplitter to TikaLinkExtractor, the issue is no longer relevant. I'll re-open the ticket, if I encounter this again. Thanks!