PuerkitoBio / gocrawl

Polite, slim and concurrent web crawler.
BSD 3-Clause "New" or "Revised" License
2.03k stars 194 forks source link

Graceful & Proper Termination, both internally and via Extender #73

Open Kleissner opened 4 years ago

Kleissner commented 4 years ago

The current stop functionality via Crawler.Stop is insufficient for multiple reasons:

I'll create a pull request with a fix. Since it affects the Extender functions, it will break compatibility, but that's a good thing here.

Kleissner commented 4 years ago

I've committed the code into my fork: https://github.com/IntelligenceX/gocrawl/commit/089a8adb75e61f99b632192d0956aa0cb9fb1c0b

I will create the PR once the other two PRs are handled.

mna commented 4 years ago

Hello Peter,

Calling Stop twice results in a panic (because it would close c.stop twice)

It doesn't panic (outside the boundaries of the package), it calls the ErrorLog if stop is called more than once.

Inside the functions Extender.Visit and Extender.Error there is right now no way of terminating the crawler. Often though, there is absolutely the need to terminate the crawler, for example if some limits are reached

I agree that would be useful.

In Crawler.collectUrls when res.idleDeath is true, it may delete workers. However, if there are 0 workers it does NOT terminate the entire crawler, leaving it in limbo (memory leak). This I consider a bug.

I'm not sure I understand that part, do you have a minimal reproducible example of that? I'd like to get it fixed in the v1 tree if indeed there is a leak/bug there.

As for the changes, I took a look at your fork but if we're going to break compatibility in a v2 to better support termination, I'd prefer to use the context.Context approach that is now standard in Go instead of passing around channels to close and cancellation functions to call. I think something like calling Crawler.Run(ctx, seeds) and having that context propagate through all the extender methods (where that makes sense) would go a long way to make that cleaner and more idiomatic. That only addresses responding to work cancellation, though, and not initiating it (your second point). That's something that should be taken into account for the proposed changes too.

Note that I haven't looked at / used gocrawl in a very long while, so those are just general comments and I'm not sure how well that could be integrated. Happy to bounce ideas about this with you and anyone who cares about this, though!

Thanks, Martin