Open NastaranVB opened 1 week ago
Hi @TaWald,
I wanted to follow up on an inquiry I made last week. Since you receive many emails daily, I thought it might have been missed. Could you please revisit my question and assist ? Any help would be appreciated. Warm regards.
Same for me. I have A100 cluster, 28 cores, 119 RAM. it works with number of precesses 4, but died if I set more. Usage of RAM for 4 processes is about 17gb-20gb. I also use GeeseFS to mount drive with data.
Hi! I'm facing RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message. when trying to run training. To train the model on cluster I use the following commands in the Linux's terminal of my cluster:
The error that I get when using the command is as below:
Any assistance to solve this error would be greatly appreciated.