Open n-Guard opened 3 months ago
Hi @n-Guard ! We managed to reproduce this. It is not clear why it happens. In the meantime, you could try calling the following snippet at the very beginning of your script:
try:
import multiprocessing
multiprocessing.set_start_method("spawn")
except Exception:
pass
What it does: it makes python use spawn instead of fork when creating a new process, so the state of the locks, queues etc will not be copied on the child processes. Not 100% sure if it will help.
Describe the bug
I'm using Keras/Tensorflow and the training stalls indefinitely after some epochs when I enable multiprocessing. It happend only when I used LSTM or TimeDistributed layers. Dense and Conv layers alone don't seem to have this problem. Without ClearML everything works fine.
To reproduce
Start a training with Tensorflow and multiprocessing enabled. Choose a model with LSTM and/or TimeDistributed layers.
I provided a script, the bug happens mostly within the first 100 epochs: https://gist.github.com/n-Guard/0f5d568cfedb3a22bfa56785e82961ad
Expected behaviour
The training should continue without getting stuck.
Environment
clearml-agent==1.7.0
clearml==1.14.4
1.14.1
2.15.0
3.11