Bugfix for Manager timing out on high intial url/request count

elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.

https://hexdocs.pm/crawly

Apache License 2.0

965 stars 114 forks source link

Bugfix for Manager timing out on high intial url/request count #151

Closed Ziinc closed 3 years ago

Ziinc commented 3 years ago

This is a bugfix where the Manager crashes due to timing out on the init callback, especially when there is a high number of start requests/urls.

This PR implements a split strategy for storing urls/requests using both sync and async methods, by storing the first 1000 requests and firing off a linked task that adds the remaining requests.

Ziinc commented 3 years ago

OK. for (1), will refactor to utilize handle_continue. For (3), since the start urls/requests are one-off insertions, I think placing them in the manager is fine. We could add async bulk request insertion to the RequestStorage, if that is what you mean.

Will open up separate issues for (2) once this is merged.