allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.43k stars 643 forks source link

Training gets stuck after some epochs when using Tensorflow with multiprocessing #1230

Open n-Guard opened 3 months ago

n-Guard commented 3 months ago

Describe the bug

I'm using Keras/Tensorflow and the training stalls indefinitely after some epochs when I enable multiprocessing. It happend only when I used LSTM or TimeDistributed layers. Dense and Conv layers alone don't seem to have this problem. Without ClearML everything works fine.

To reproduce

Start a training with Tensorflow and multiprocessing enabled. Choose a model with LSTM and/or TimeDistributed layers.

I provided a script, the bug happens mostly within the first 100 epochs: https://gist.github.com/n-Guard/0f5d568cfedb3a22bfa56785e82961ad

Expected behaviour

The training should continue without getting stuck.

Environment

eugen-ajechiloae-clearml commented 2 months ago

Hi @n-Guard ! We managed to reproduce this. It is not clear why it happens. In the meantime, you could try calling the following snippet at the very beginning of your script:

try:
    import multiprocessing
    multiprocessing.set_start_method("spawn")
except Exception:
    pass

What it does: it makes python use spawn instead of fork when creating a new process, so the state of the locks, queues etc will not be copied on the child processes. Not 100% sure if it will help.