Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Orphans not being deleted #548

Closed Hakky54 closed 4 years ago

Hakky54 commented 5 years ago

Hi,

I have a situation where Orphans are never deleted from my Elasticsearch index when the page is not found anymore but still being referenced on one of the urls. To reproduce the situation I hosted tree links on my local environment with nginx:

I gave the instruction to the crawler to crawl the following url's

<startURLs>
    <url>http://localhost:8080/test/1</url>
    <url>http://localhost:8080/test/3</url>
</startURLs>

The page http://localhost:8080/test/2/ will be picked up because it is added as a hyperlink to the html page of http://localhost:8080/test/1/

So I started the crawler and it can successfully finds tree documents and it has pushed it to Elasticsearch. After that I disabled the following url: http://localhost:8080/test/2/. So the crawler will get 404 when it tries to crawl that url. I would expect that it should send a delete request to elastic in the committer but that doesn't happen.

I already provided the following property to the xml configuration of the crawler:

<orphansStrategy>DELETE</orphansStrategy>

Also tried adding the configuration below, but it didn't work.

<spoiledReferenceStrategizer class="com.norconex.collector.core.spoil.impl.GenericSpoiledReferenceStrategizer">
    <mapping state="NOT_FOUND" strategy="DELETE" />
    <mapping state="BAD_STATUS" strategy="DELETE" />
    <mapping state="ERROR" strategy="IGNORE" />
</spoiledReferenceStrategizer>

I debugged the code and it didn't created a *-del.ref file in the committer-queue.

It had the following content in the crawlerstore for mapProcessedInvalid method

screenshot 2018-12-17 at 13 51 43

Any idea if I am missing some configuration or property?

essiembre commented 5 years ago

I do not see anything wrong with your approach. It should send a deletion request.
Have you cleared the crawlstore between runs? If so, that would explain why no deletion requests were sent.

If not, can you attach your log file and share your full config in an attempt to reproduce?

Hakky54 commented 5 years ago

It was unfortunately deleting the crawlstore for each new run. I disabled that and tried it again. While debugging it I could see that it is was adding it to the queue of the crawlerDataStore in the deleteCacheOrphans methods in the AbstractCrawler. It is saying in the logs that it deleted the orphan but it is actually not being removed from Elasticsearch.

I added the logs, configuration and even the test pages as attachment. I replaced the client specific packages to "mycompany" to preventing their name being public here...

crawler.zip

essiembre commented 5 years ago

Your committer seems to be custom so if there is an issue there, I can't tell. Do you get the same with the Norconex Elasticsearch Committer?

I see in the committer config you have "urlHash" defined as the "sourceReferenceField". Where is that value coming from? Because the URL sent for deletion is the original URL, after it went through URL normalization. So if it was added with "urlHash", and it tries to delete using "document.reference" (i.e. the URL), it may very well be why it is failing to delete it (there is no match).

Hakky54 commented 5 years ago

The urlHash is something we are creating on the fly for each document and it is a base64 encoded string of the url. We set this back in the metadata and store it together with the document in Elasticsearch. I disabled setting the custom sourceReferenceField field and that did the trick. It is now deleting orphans correctly. Thank you for your explanation! :)

Is it also possible to make it working with a custom sourceReferenceField? Would that be a new feature request or is it not recommended to have a custom one?

essiembre commented 5 years ago

You mean making the deletion work with a sourceReferenceField? If so, it is a bit tricky, because the sourceReferenceField can be anything. For example, it could be a value extracted during importing, a UUID, or whatever else. So the relationship between the original URL and the final "reference" field may not always be the same. When a document is no longer found, it cannot be re-parsed so you do not get that final reference field to fire the right deletion. URL normalization is applied again, but more than that, you are out of lock.

Still, we could cover many use cases by making that association somehow explicit in the Collector configuration and caching that association it in case deletion occurs so we send the right thing to the Committer. I will mark as a feature request.

Hakky54 commented 5 years ago

Yes, making the deletion work with sourceReferenceField.I think that would only be possible if the property can be set after the urlNormalizer. Now this property can only be set at the committer level.

So my suggestion would be something like this:

<crawler id="test">
    <startURLs>
        <url>http://localhost:8080/test/1</url>
        <url>http://localhost:8080/test/3</url>
    </startURLs>
    <maxDepth>2</maxDepth>
    <userAgent>Mozilla/5.0 (compatible)</userAgent>
    <referenceFilters>
    </referenceFilters>
    <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
        <normalizations>
            addDomainTrailingSlash,encodeSpaces,lowerCaseSchemeHost,removeDefaultPort,
            removeDotSegments,removeDuplicateSlashes,removeEmptyParameters,
            removeFragment,removeSessionIds,removeTrailingQuestionMark
        </normalizations>
    </urlNormalizer>
    <!-- this line below -->
    <sourceReferenceField class="com.norconex.collector.http.url.impl.SourceReferenceFieldSetter">urlHash</sourceReferenceField>
</crawler>

And do you have any estimation when you would/could pick this feature request and have it implemented so the world could use this amazing feature?