Bug: Job exceeded memory limit - Githubissues

coqui-ai / STT

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.

https://coqui.ai

Mozilla Public License 2.0

2.27k stars 275 forks source link

Bug: Job exceeded memory limit #2238

Closed jereb321 closed 2 years ago

jereb321 commented 2 years ago

I am getting the following error when trying to train the model:

Epoch 0 | Training | Elapsed Time: 0:00:14 | Steps: 89 | Loss: 25.274169 Epoch 0 | Training | Elapsed Time: 0:00:15 | Steps: 91 | Loss: 24.980746 Epoch 0 | Training | Elapsed Time: 0:00:15 | Steps: 93 | Loss: 24.580695 slurmstepd: error: Job 7518 exceeded memory limit (626359956 > 512000000), being killed slurmstepd: error: Exceeded job memory limit slurmstepd: error: JOB 7518 ON xantipa CANCELLED AT 2022-06-10T17:11:35 Epoch 0 | Training | Elapsed Time: 0:00:15 | Steps: 95 | Loss: 24.186966

I am using slurm and singularity container. I am using the following docker image: ghcr.io/coqui-ai/stt-train:v1.3.0.

I run the following command to train inside my singularity container:

singularity exec --nv python -m coqui_stt_training.train \ --train_files /STT/LARGE/train.csv \ --dev_files /STT/LARGE/dev.csv \ --test_files /STT/LARGE/test.csv \ --drop_source_layers 1 \ --alphabet_config_path /STT/alphabet.txt \ --save_checkpoint_dir /STT/MODEL \

I already set max memory, that SLURM alows. Tried different docker images (1.4, 1.3, 1.0, 0.1). Only v0.1.0 working.

reuben commented 2 years ago

How much memory are you allocating for the job? This is not a bug, your system just doesn't have enough resources for training a model.