elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
965 stars 113 forks source link

[Feat] Support auto close spider when all requests finished #202

Closed EdmondFrank closed 1 year ago

EdmondFrank commented 2 years ago

It seems that when all the workers’ requests list are empty, the crawler still cannot stop automatically.

Although closespider_timeout can solve some scenarios, there is a new problem of ending early when the network environment is not good.

oltarasenko commented 2 years ago

Hey @EdmondFrank .

It's not quite clear if this approach is going to solve the issue. But still, what is the problem of having just the closespider timeout

EdmondFrank commented 2 years ago

Hey @EdmondFrank .

It's not quite clear if this approach is going to solve the issue. But still, what is the problem of having just the closespider timeout

Now in the process of using crawly, I encountered two problems.

First, i need develop some slow crawler, the request frequency is about 1 request/60~90s.

Second, some of the websites I crawl are not very stable, sometimes experiencing a denial of service for a few minutes before returning to normal.

In the above two scenarios, sometimes closespider timeout will be 0/min , but all requests have not been crawled yet

revati commented 1 year ago

This has been merged in master on September 14th. The same day Crawly 0.14 has been released. But it seems this features is not part of 0.14 release. Was it intentional?