DjangoAdminHackers / django-linkcheck

An app that will analyze and report on links in any model that you register with it. Links can be bare (urls or image and file fields) or embedded in HTML (linkcheck handles the parsing). It's fairly easy to override methods of the Linkcheck object should you need to do anything more complicated (like generate URLs from slug fields etc).
BSD 3-Clause "New" or "Revised" License
75 stars 26 forks source link

Threaded signal handler and large updates #65

Closed andybak closed 6 years ago

andybak commented 7 years ago

It's already been reported that bulk actions (usually via management commands or in the shell) can cause problems as too many threads get spawned.

I've just hit a problem where this happens in a web form also and would like some feedback on the best solution.

Basically - I've got a client who has (ab)used a page template that allows an arbitrary number of inlines and added a lot more than I was expecting. Each inline has a wysiwyg field and therefore triggers linkcheck to do it's thing. I'm getting "Exception Value: can't start new thread"

Now - my first thought is to use something like multiprocessing.pool.ThreadPool to get a balance between execution time and number of threads - and to expose this as a setting.

It does feel somewhat like a band-aid. Spawning a thread for every single instance save() is going to end up causing problems at some point.

Really we should be setting a dirty flag or creating a task queue and working through all the updates in a background task.

@claudep and others - I'd appreciate your thoughts on whether it's worth doing this properly or whether the thread pool band-aid is good enough for now. (and if you've got any better ideas)

claudep commented 7 years ago

Of course, having a proper queue system would probably be the best remedy. However, setting up a queue with workers is not always trivial and would probably be overkill in this situation. I've never used process pools myself, but it looks like an appropriate fix. What makes you think it will not solve the issue?

andybak commented 7 years ago

What makes you think it will not solve the issue?

Technically speaking it only defer the issue and increase capacity. But with some sensible warnings and errors handling ("Please disable linkcheck if you're going to update 152 million rows") that might be sufficient.

andybak commented 7 years ago

@fruitschen - would you like to take an initial look at this and see what you think?

fruitschen commented 7 years ago

yes, I'll look at it.

fruitschen commented 7 years ago

Added a tasks_queue and using a single thread to process it fixes our problem of "Exception Value: can't start new thread". https://github.com/DjangoAdminHackers/django-linkcheck/commit/0d398c6380cdf26c684e49a66a652420e6a70a7f

For extremely large update, we could set a limit for the size of tasks_queue and stops adding anything to it after a certain length.