Norconex / committer-solr

Solr implementation of Norconex Committer. Should also work with any Solr-based products, such as LucidWorks.
https://opensource.norconex.com/committers/solr/
Apache License 2.0
3 stars 5 forks source link

Full recrawl and reindex in Solr #18

Open OkkeKlein opened 4 years ago

OkkeKlein commented 4 years ago

How would one go about doing a full recrawl of content (filesystemcrawler) and then only do a (hard) commit after all content has been indexed.

So basically do a fresh update on a live system.

essiembre commented 4 years ago

To perform a "clean" crawl (without sending only modifications, deletions, etc.) you simply have to delete your "workdir". More precisely, the crawlstore.

To not commit until you are done, you can set the "solrCommitDisabled" option to "true" in your Solr committer section. This means the committer will never send a Solr "commit" request, thus relying on your Solr configuration to decide when to commit, or your manual commit.

OkkeKlein commented 4 years ago

Thank you!

OkkeKlein commented 4 years ago

How to deal with deleted docs? A manual delete all? Or maybe parameter send with Solr committer?

essiembre commented 4 years ago

A simple approach would be to add two fields in your collection. One that identifies the source crawler, the second that identifies the crawl date. In your config, you can use the ConstantTagger to populate the first one, and for the second, you can use the CurrentDateTagger.

With this, you can use the "delete by query" approach on Solr. You would issue a query that deletes anything older than the date of your full recrawl, for the given crawler.