Feature: Properly handling thread duplication of services and of transformers

guillaume-chevalier commented 3 years ago

Is your feature request related to a problem? Please describe. When multithreading, we sometimes need to clone steps, data, and context. We need a way to indicate how to copy this efficiently, as some data, steps, or services, may require different copy techniques (e.g.: in memory repo v.s. on disk repo, tensorflow v2 step, data in the GPU, etc).

Describe the solution you'd like Like we have a Saver for serializing steps to disks, we could have a Cloner to clone steps (and services and perhaps even data). The base cloner would probably attempt to do a copy.copy or copy.deepcopy (investigation needed).

Describe alternatives you've considered

Doing a simple deepcopy for steps that are seriazliable
For non-serializable steps such as a tensorflow V2 step, using the savers to save the step and reloading it in the other thread. This is time consuming as it uses disks heavily. We also thought (and tried to use) mounted RAM disks to make this fast, but the dependencies were too heavy and therefore we didn't include this in neuraxle.

FYI @vincent-antaki

guillaume-chevalier commented 3 years ago

Related to #485: Services could have savers as well, not only cloners. It would be interesting to allow to save services. But do we have this use-case of saving services too ?

vincent-antaki commented 3 years ago

Extra information : So far, we've let the multiprocess library performs its default behaviour (which is a deep copy if i remember correctly). And, as far as step copying goes, we're copying them before setup which saves us the trouble of copying the heavy stuff.

We've assumed that any service that will be used by multiple process will implements its own concurrency management mechanism. That being said, this method has its limits because, at the moment, context instances are thread-agnostic, so we can't have process specific behaviour/services at the moment except in the attempt_trial function.

This way of working is sufficient for our current projects but I can think of many applications for which it wouldn't cut it (such as any model or service that would require shared memory).

Thus I agree with the general idea of formalizing the duplication of steps, context, services and data.

vincent-antaki commented 3 years ago

Related to #485: Services could have savers as well, not only cloners. It would be interesting to allow to save services. But do we have this use-case of saving services too ?

I can think of a couple of examples where we could have written less code for the same behaviour by having this option. e.g. the client project where we have a Context instance in the client code, which is in charge of creating with the proper infrastructure for Neuraxle's ExecutionContext.

I don't see it as urgent but as a nice-to-have feature.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs in the next 180 days. Thank you for your contributions.

Neuraxio / Neuraxle

Feature: Properly handling thread duplication of services and of transformers #484