Changing maxdepth to zero did not delete unwanted documents from index

Norconex / committer-azuresearch

Implementation of Norconex Committer for Microsoft Azure Search.

https://opensource.norconex.com/committers/azuresearch/

Apache License 2.0

1 stars 2 forks source link

Changing maxdepth to zero did not delete unwanted documents from index #3

Open avi7777 opened 6 years ago

avi7777 commented 6 years ago

Hi, Initially i ran the job with <maxdepth>1</maxdepth>, which committed several documents that were not necessary. So later, i changed the maxdepth value from 1 to zero. Doing this crawler crawled only those urls that were provided and the job was finished early compared to when it was run with maxdepth 1, but the unnecessary documents which were already committed when ran with maxdepth 1 , did not get deleted from the azure index when ran for the second time with maxdepth zero.

Workaround tried: deleted the index and re run the job. This fixed the issue.

Could you help me how to resolve the issue without have to deleting the existing index and re running the crawler again.

essiembre commented 6 years ago

When setting it from 1 to zero, you will end up with many "orphans". I would check the orphan strategy you supplied. By default, it is set to PROCESS which will attempt to recrawl orphans. If you want orphans deleted instead, set it to DELETE (in your collector config).

avi7777 commented 6 years ago

Hi ,

Thanks for the quick reply.

I have set the orphanstrategy as DELETE itself but still the files are not getting deleted from index. <orphansStrategy>DELETE</orphansStrategy> It is observed that if i give other xml file path as start url in same collector config, removing the old sitemap xml path for which indexing is already done, in that case existing documents are getting deleted from index and it will freshly update the index with new documents crawled for new sitemap url.

essiembre commented 6 years ago

Is it possible your sitemap lists all URLs so setting the maxDepth to zero or 1 changes very little? Can you provide exact steps to reproduce (actual sitemap URL and a sample URL that should be deleted)?

avi7777 commented 6 years ago

Hi , You can follow these steps to reproduce: 1) Have a sitemap.xml with 400urls in it. 2) crawl the sitemap.xml for the first time keeping as 1. 3) Check the document count that gets committed to index. 4) Now change maxdepth 1 to zero. 5) Recrawl it again. Check whether unnecessary urls gets deleted from index or not. Means the document count in index should be reduced this time.

avi7777 commented 6 years ago

Hi , How does the norconex get to know there are Orphan urls that are present in the index. Whether the collector will compare with crawlstore directory to know the previous execution and delete the Orphans taking crawlstore directory as reference ? In my case i am deleting the working directory and recrawling the collector ,, and my expectation is that it should automatically override the azure index. Please let me know how deletion works .

essiembre commented 6 years ago

Yes, the crawlstore is used for finding orphans. Every document reference is stored with a checksum. On the next run, all previously stored references are used as a "cache". Every document that is processed again is removed from that cache. Those that remain in the end are the "orphans".

If you delete the workdir (more specifically the crawlstore), then the previous crawl history is lost and it is as if you crawl fresh. It will overwrite what is already in Azure under the same unique id, but it won't know what should be deleted. If you really want to start clean, the best approach is to wipe out your collection when you wipe out the workdir (or use a new one) or handle deletion manually in Azure somehow (based on a timestamp you may have or else).