_partial.fit modifies estimator inplace

dask / dask-ml

Scalable Machine Learning with Dask

BSD 3-Clause "New" or "Revised" License

892 stars 255 forks source link

Questions from @mrocklin in https://github.com/dask/dask-ml/pull/275#issuecomment-402269422

Shouldn't we be cloning the model here before calling partial fit? Otherwise we're mutating the input. What if we have to rerun this task because the worker that the result was on failed?

I'm not sure what happens when the worker fails :)

Let's say that

Worker A completed partial_fit on the first block of data
Worker B fails on the second block of data.

IIRC, when a worker fails during computation, the scheduler will mark the task as suspicions and reschedule the task on another worker. Let's say it's scheduled on worker C for whatever reason.

Worker C asks worker A for fit-<token>-0. I think everything is OK. The scheduler should always have a correct understanding of who has the latest successful fit call.

Does that sound right? Am I missing scenarios where we do something wrong?

dask / dask-ml

_partial.fit modifies estimator inplace #277