Neuraxio / Neuraxle

The world's cleanest AutoML library ✨ - Do hyperparameter tuning with the right pipeline abstractions to write clean deep learning production pipelines. Let your pipeline steps have hyperparameter spaces. Design steps in your pipeline like components. Compatible with Scikit-Learn, TensorFlow, and most other libraries, frameworks and MLOps environments.
https://www.neuraxle.org/
Apache License 2.0
608 stars 62 forks source link

Feature: Properly handling thread duplication of services and of transformers #484

Closed guillaume-chevalier closed 1 year ago

guillaume-chevalier commented 3 years ago

Is your feature request related to a problem? Please describe. When multithreading, we sometimes need to clone steps, data, and context. We need a way to indicate how to copy this efficiently, as some data, steps, or services, may require different copy techniques (e.g.: in memory repo v.s. on disk repo, tensorflow v2 step, data in the GPU, etc).

Describe the solution you'd like Like we have a Saver for serializing steps to disks, we could have a Cloner to clone steps (and services and perhaps even data). The base cloner would probably attempt to do a copy.copy or copy.deepcopy (investigation needed).

Describe alternatives you've considered

FYI @vincent-antaki

guillaume-chevalier commented 3 years ago

Related to #485: Services could have savers as well, not only cloners. It would be interesting to allow to save services. But do we have this use-case of saving services too ?

vincent-antaki commented 3 years ago

Extra information : So far, we've let the multiprocess library performs its default behaviour (which is a deep copy if i remember correctly). And, as far as step copying goes, we're copying them before setup which saves us the trouble of copying the heavy stuff.

We've assumed that any service that will be used by multiple process will implements its own concurrency management mechanism. That being said, this method has its limits because, at the moment, context instances are thread-agnostic, so we can't have process specific behaviour/services at the moment except in the attempt_trial function.

This way of working is sufficient for our current projects but I can think of many applications for which it wouldn't cut it (such as any model or service that would require shared memory).

Thus I agree with the general idea of formalizing the duplication of steps, context, services and data.

vincent-antaki commented 3 years ago

Related to #485: Services could have savers as well, not only cloners. It would be interesting to allow to save services. But do we have this use-case of saving services too ?

I can think of a couple of examples where we could have written less code for the same behaviour by having this option. e.g. the client project where we have a Context instance in the client code, which is in charge of creating with the proper infrastructure for Neuraxle's ExecutionContext.

I don't see it as urgent but as a nice-to-have feature.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs in the next 180 days. Thank you for your contributions.