dssg / triage

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems
Other
182 stars 61 forks source link

threading backend for model training and predictions #866

Closed shaycrk closed 2 years ago

shaycrk commented 2 years ago

We've seen a warning message Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1 in triage runs and wanted to explore this further. The issue here is the nested parallelization, with model types like random forest trying to use multiple jobs already within the multiprocessing we're doing in triage.

This draft PR explicitly uses a threading backend for both training and predicting and I had a couple of observations in initial testing:

I'm opening this as a draft PR since I think it could still benefit from a little more input from people with more knowledge of python multiprocessing (looking at you @thcrock, @nanounanue, and @jesteria) and perhaps a little more testing with real projects beyond the small tests I did with some synthetic data.

From what I've read, I wonder if Dask might be a better longer-term solution, though I'd wonder if we would want to make use of it for both the triage and sklearn layers of parallelization if so. Likewise, I wonder about whether we might want to eventually decouple training, predicting, and running evaluations (though the trade-off here might be lots of I/O between those tasks). In any case, perhaps using the threading backend as I'm doing here would be reasonable for removing the warning on predictions in the shorter term?

Also, for reference, here are a few of the things I was reading in looking into this:

(as it happens, we also ran into this issue with the recent CfA project -- Issue 15 in that repo)

jesteria commented 2 years ago

Briefly: The concern with Python's Global Interpreter Lock (GIL) is that threading does not permit the use of multiple CPUs. Threads otherwise perform properly; but, this limitation makes them most suitable to I/O-bound workloads, rather than CPU-bound, (as you gain no additional processing power by threading purely CPU-bound work, and if anything might lose performance due to the overhead of managing threads).

I don't know if that explains your experience with threading (above). Certainly, not every workload benefits from parallelization; but, that could be the case. (That said, well-written Python libraries that use threading over multiprocessing do so with the above in mind, and for good reason … at least in theory.)

It is possible to work around the GIL in threading and there are libraries that do so; though, this is a fancy enough fix/hack that in general it's made quite clear that they're doing so. Greenlets come to mind. (That is, another possibility with a threaded library that operates without issue is simply that it is threading, as normal for Python, i.e. bound to a single CPU. Whether that is useful to your workload is another question.)

As for loky: it describes itself as built around Python's ProcessPoolExecutori.e. it uses multiple processes, not threads. If I had to guess, the problem arises because multiprocessing by default uses daemonic child processes, which cannot spawn their own child processes. (So it could be worked around and without anything so drastic as hacking Python itself. But nonetheless it could get a little icky.) Switching to a threaded top-level pool works because it's the main process which is forking child processes (via loky); the use of threading in the main process doesn't have any bearing (so long as it's the main thread that does the forking anyway).

As for what would benefit Triage, I can't really say. IIRC Triage uses a multiprocessing backend for good reason.

All I can say are the following generalities, (which I feel have been substantiated by modeling frameworks I've worked on since, though admittedly none as substantial as Triage):

Indeed I think Triage already does this fairly well, and I'm sure I'm preaching to the choir. But I raise it because I would guess that Dask or anything like it would be very useful, but likely not for all applications of Triage. The less work that can be done to support this sort of logical forking the better. IIRC Triage already takes a backend/plugin approach to multiprocessing; it could with Dask as well.

Of course there is an overhead to decoupling, etc. A degree of logical overhead is unavoidable. There needn't however be a significant performance overhead – the decoupled interface may indicate a path on disk or in S3, and/or a reference to an object in memory. Rather, that something like Dask might require additional I/O is a reason that users might not like to default to this backend.

That wasn't so brief. But you raised some larger questions! To be actually brief: maybe you just need to ensure that loky is only invoked from the parent/main process…. (Or, if it must be invoked by child processes, then these must not be daemonic, and you have to ensure they're killed properly, etc.) As for the rest… 🤷‍♂️

On Thu, Sep 16, 2021, 1:19 PM Kit Rodolfa @.***> wrote:

We've seen a warning message Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1 in triage runs and wanted to explore this further. The issue here is the nested parallelization, with model types like random forest trying to use multiple jobs already within the multiprocessing we're doing in triage.

This draft PR explicitly uses a threading backend for both training and predicting and I had a couple of observations in initial testing:

  • I had seen comments online about using threading potentially not actually yielding parallelization because of python's Global Interpreter Lock but I don't seem to encounter those issues here, so it seems perhaps RandomForests are releasing the lock in a way that allows threading to work.
  • These changes do indeed get rid of the Loky-backed parallel loops warning
  • However, it looks like this warning is arising on prediction, while model training seems to be using a threading backend regardless (at least in my tests) and manages to make use of all cores anyway
  • I wasn't seeing any speed improvements in my testing here (presumably because running the predictions are fairly fast anyway?)

I'm opening this as a draft PR since I think it could still benefit from a little more input from people with more knowledge of python multiprocessing (looking at you @thcrock https://github.com/thcrock, @nanounanue https://github.com/nanounanue, and @jesteria https://github.com/jesteria) and perhaps a little more testing with real projects beyond the small tests I did with some synthetic data.

From what I've read, I wonder if Dask might be a better longer-term solution, though I'd wonder if we would want to make use of it for both the triage and sklearn layers of parallelization if so. Likewise, I wonder about whether we might want to eventually decouple training, predicting, and running evaluations (though the trade-off here might be lots of I/O between those tasks). In any case, perhaps using the threading backend as I'm doing here would be reasonable for removing the warning on predictions in the shorter term?

Also, for reference, here are a few of the things I was reading in looking into this:

- https://scikit-learn.org/stable/modules/generated/sklearn.utils.parallel_backend.html

(as it happens, we also ran into this issue with the recent CfA project -- Issue 15 in that repo)

You can view, comment on, or merge this pull request online at:

https://github.com/dssg/triage/pull/866 Commit Summary

  • threading backend for model training and predictions

File Changes

Patch Links:

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dssg/triage/pull/866, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEBUNU6WMZTJIZKUJIZLBTUCIYMDANCNFSM5EFKZNGA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

nanounanue commented 2 years ago

I didn't see this warning message on my tests running a previous version of triage (i.e. 4.x). Is this something new due to Python or Sklearn?

I wonder why wasn't there before ... I am gonna try again on my local machine and I will report back if shows up ...

shaycrk commented 2 years ago

@jesteria -- thanks for the very detailed notes! I think I need to give them one more read to make sure I follow 😄

@nanounanue -- which version of x for 4.x? We've definitely seen this warning prior to the v5.0 updates. Also note that I've only seen it when generating predictions, not when training, so make sure you're using save_predictions=True in your testing.

nanounanue commented 2 years ago

Kit, I will do that, I have several versions installed, and I will run them.

On Fri, Sep 17, 2021, at 16:53, Kit Rodolfa wrote:

@jesteria https://github.com/jesteria -- thanks for the very detailed notes! I think I need to give them one more read to make sure I follow 😄

@nanounanue https://github.com/nanounanue -- which version of x for 4.x? We've definitely seen this warning prior to the v5.0 updates. Also note that I've only seen it when generating predictions, not when training, so make sure you're using save_predictions=True in your testing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dssg/triage/pull/866#issuecomment-922101861, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADYXQGFVTUPZYQLO65JHP3UCO2HJANCNFSM5EFKZNGA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

thcrock commented 2 years ago

I don't see anything wrong with this; it seems that if the initial message was associated with prediction as opposed to training, I wouldn't expect that fixing the warning would necessarily give you a speedup.

I'm on board with Dask as well for the future; at least on paper it seems like it would allow us to break out of SQL-land while not requiring the user to run a Spark cluster or something. The packages like https://blazingsql.com/ and https://dask-sql.readthedocs.io/en/latest/ might be utilized to allow cohort/label/feature generation to be configured similarly as they are now. And it would allow an easier path for power users to, say, write their own raw python feature generation.

shaycrk commented 2 years ago

Going ahead and merging this for now to eliminate the warning, but we should revisit multiprocessing in triage more broadly (maybe dask or maybe something else) at some point as well.