Norconex / collector-core

Collector-related code shared between different collector implementations
http://www.norconex.com/collectors/collector-core/
Apache License 2.0
7 stars 15 forks source link

Question - confirm that I can force recrawl by deleting reference #22

Closed danizen closed 6 years ago

danizen commented 6 years ago

Jupyter notebooks can be very dangerous:

for idlist in idlistcandrop:
     es.dsl_search().query('terms', url=idlist).delete()

That should have been a query on id, because I'm using hashes of urls now. What I'm wondering about is whether it is enough to force the crawler on its next run to find these to delete its reference from crawlstore.

Thanks

essiembre commented 6 years ago

Can you elaborate? You want deletions in crawlstore to result in deletions request to your committer? If so, I would add a reference filter and set it so orphans are deleted.

danizen commented 6 years ago

No, I have accidentally deleted the documents in elasticsearch while attempting to "cleanup" some data. I want those documents to be recrawled. I think I can simply remove them from the "references" collection prior to re-running the crawler. The "references" are then copied to "cached" collection, but when a new reference is discovered, there will be no hit in "cached". I am simply checking that this is OK, and even a feature of using MongoDB that has a good command-line (vs MVStore).

essiembre commented 6 years ago

I see. Reference deleted in the crawlstore will be recrawled yes, because it will think they are "new". As an alternative, if you want to keep any other cached information about those references, you can look at only wiping out the checksum fields. That will force every recrawled document to be considered "modified" and they will be sent to your committer again.

Make sense?

danizen commented 6 years ago

Thanks - both make sense.