Training gets stuck after some epochs when using Tensorflow with multiprocessing

allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution

Apache License 2.0

5.43k stars 643 forks source link

Describe the bug

I'm using Keras/Tensorflow and the training stalls indefinitely after some epochs when I enable multiprocessing. It happend only when I used LSTM or TimeDistributed layers. Dense and Conv layers alone don't seem to have this problem. Without ClearML everything works fine.

To reproduce

Start a training with Tensorflow and multiprocessing enabled. Choose a model with LSTM and/or TimeDistributed layers.

I provided a script, the bug happens mostly within the first 100 epochs: https://gist.github.com/n-Guard/0f5d568cfedb3a22bfa56785e82961ad

Expected behaviour

The training should continue without getting stuck.

Environment

Server type: self hosted
ClearML SDK Version: clearml-agent==1.7.0 clearml==1.14.4
ClearML Server Version: 1.14.1
Tensorflow Version: 2.15.0
Python Version: 3.11
OS: Linux

allegroai / clearml