Closed tanguilp closed 1 year ago
So this feature request would involve purging the request storage when performing your live crawl? I think exposing purge functionality is possible, so that one could call it manually when the spider is reloaded for example.
Let me know your thoughts.
I think it would rather involve filtering after popping a link to crawl.
Use case is:
But I don't now if Crawly has persistent backend store.
That said I don't think this is very much needed for now, so I'd suggest closing the issue for now and reopen if needed.
Requests are filtered before being added to the request storage, so as to discard irrelevant pages.
When crawling large sites, some filtering rules may be added after crawling is started. It usually involves updating the filters and updating (by restarting, or hot code reloading) the spider (assuming we're using a persistent storage backend).
In this case, some requests saved before the rules' update will be browsed anyway. This would be nice to find a way to filter them out also when popping.