Closed jereb321 closed 2 years ago
I am getting the following error when trying to train the model:
Epoch 0 | Training | Elapsed Time: 0:00:14 | Steps: 89 | Loss: 25.274169 Epoch 0 | Training | Elapsed Time: 0:00:15 | Steps: 91 | Loss: 24.980746 Epoch 0 | Training | Elapsed Time: 0:00:15 | Steps: 93 | Loss: 24.580695 slurmstepd: error: Job 7518 exceeded memory limit (626359956 > 512000000), being killed slurmstepd: error: Exceeded job memory limit slurmstepd: error: JOB 7518 ON xantipa CANCELLED AT 2022-06-10T17:11:35 Epoch 0 | Training | Elapsed Time: 0:00:15 | Steps: 95 | Loss: 24.186966
I am using slurm and singularity container. I am using the following docker image: ghcr.io/coqui-ai/stt-train:v1.3.0.
I run the following command to train inside my singularity container:
singularity exec --nv python -m coqui_stt_training.train \ --train_files /STT/LARGE/train.csv \ --dev_files /STT/LARGE/dev.csv \ --test_files /STT/LARGE/test.csv \ --drop_source_layers 1 \ --alphabet_config_path /STT/alphabet.txt \ --save_checkpoint_dir /STT/MODEL \
I already set max memory, that SLURM alows. Tried different docker images (1.4, 1.3, 1.0, 0.1). Only v0.1.0 working.
How much memory are you allocating for the job? This is not a bug, your system just doesn't have enough resources for training a model.
I am getting the following error when trying to train the model:
Epoch 0 | Training | Elapsed Time: 0:00:14 | Steps: 89 | Loss: 25.274169 Epoch 0 | Training | Elapsed Time: 0:00:15 | Steps: 91 | Loss: 24.980746 Epoch 0 | Training | Elapsed Time: 0:00:15 | Steps: 93 | Loss: 24.580695 slurmstepd: error: Job 7518 exceeded memory limit (626359956 > 512000000), being killed slurmstepd: error: Exceeded job memory limit slurmstepd: error: JOB 7518 ON xantipa CANCELLED AT 2022-06-10T17:11:35 Epoch 0 | Training | Elapsed Time: 0:00:15 | Steps: 95 | Loss: 24.186966
I am using slurm and singularity container. I am using the following docker image: ghcr.io/coqui-ai/stt-train:v1.3.0.
I run the following command to train inside my singularity container:
singularity exec --nv python -m coqui_stt_training.train \ --train_files /STT/LARGE/train.csv \ --dev_files /STT/LARGE/dev.csv \ --test_files /STT/LARGE/test.csv \ --drop_source_layers 1 \ --alphabet_config_path /STT/alphabet.txt \ --save_checkpoint_dir /STT/MODEL \
I already set max memory, that SLURM alows. Tried different docker images (1.4, 1.3, 1.0, 0.1). Only v0.1.0 working.