Closed danizen closed 6 years ago
Can you elaborate? You want deletions in crawlstore to result in deletions request to your committer? If so, I would add a reference filter and set it so orphans are deleted.
No, I have accidentally deleted the documents in elasticsearch while attempting to "cleanup" some data. I want those documents to be recrawled. I think I can simply remove them from the "references" collection prior to re-running the crawler. The "references" are then copied to "cached" collection, but when a new reference is discovered, there will be no hit in "cached". I am simply checking that this is OK, and even a feature of using MongoDB that has a good command-line (vs MVStore).
I see. Reference deleted in the crawlstore will be recrawled yes, because it will think they are "new". As an alternative, if you want to keep any other cached information about those references, you can look at only wiping out the checksum fields. That will force every recrawled document to be considered "modified" and they will be sent to your committer again.
Make sense?
Thanks - both make sense.
Jupyter notebooks can be very dangerous:
That should have been a query on id, because I'm using hashes of urls now. What I'm wondering about is whether it is enough to force the crawler on its next run to find these to delete its reference from crawlstore.
Thanks