elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
982 stars 115 forks source link

Filter out requests when popping from request storage #146

Closed tanguilp closed 1 year ago

tanguilp commented 3 years ago

Requests are filtered before being added to the request storage, so as to discard irrelevant pages.

When crawling large sites, some filtering rules may be added after crawling is started. It usually involves updating the filters and updating (by restarting, or hot code reloading) the spider (assuming we're using a persistent storage backend).

In this case, some requests saved before the rules' update will be browsed anyway. This would be nice to find a way to filter them out also when popping.

Ziinc commented 1 year ago

So this feature request would involve purging the request storage when performing your live crawl? I think exposing purge functionality is possible, so that one could call it manually when the spider is reloaded for example.

Let me know your thoughts.

tanguilp commented 1 year ago

I think it would rather involve filtering after popping a link to crawl.

Use case is:

  1. Store millions of links
  2. Update the rules
  3. Relaunch the crawl process and discard rule that don't obey the new rules

But I don't now if Crawly has persistent backend store.

That said I don't think this is very much needed for now, so I'd suggest closing the issue for now and reopen if needed.