Bug: Batches are not evaluated in parallel

NanoNabla commented 2 years ago

Describe the bug I tried running STT on a system with 8 NVIDA A100 GPUs. I experienced that running on up to 4 devices does not scale well. Each batch seems to be serialized across the GPUs. In the first steps everything looks ok but then parallel execution gets worse over the time. I attaches traces generated by NVIDIA NSight Systems

Screenshot from 2022-07-08 14-07-43

I checked by environment using a simple mnist example where this problem does not occur.

Environment (please complete the following information): I used NVIDIAS Docker container nvcr.io/nvidia/tensorflow:22.02-tf1-py3 as it is used in Dockerfile.train

python -m coqui_stt_training.train \
    --train_files $CLIPSDIR/train.csv \
    --train_batch_size 128 \
    --n_hidden 2048 \
    --learning_rate 0.0001 \
    --dropout_rate 0.40 \
    --epochs 2 \
    --cache_for_epochs 10 \
    --read_buffer 1G \
    --shuffle_batches true \
    --lm_alpha 0.931289039105002 \
    --lm_beta 1.1834137581510284 \
    --log_level 1 \
    --checkpoint_dir ${OUTDIR}/checkpoints \
    --summary_dir ${OUTDIR}/tensorboard \
    --train_cudnn true \
    --alphabet_config_path=$CLIPSDIR/alphabet.txt \
    --skip_batch_test true \
    --automatic_mixed_precision true \
    --show_progressbar false

FivomFive commented 2 years ago

Any updates on this issue? I also got into the situation running STT on a computer with 4xA100 gpus. Using 2 gpus instead of just one improves training times but much less than by half. Using more than 2 gpus does not make anything further improvements in training time. Training takes longer using 4 than using 2 gpus. I think I run into the same problem as @NanoNabla? How could I check this?

SuperKogito commented 2 years ago

can you maybe share your setup and experience on training with coqui tools. I am trying to improve my setup but I have no clue which / how many GPUs I should go for? :(

NanoNabla commented 2 years ago

Can you give more details which information you need @SuperKogito ? I tried running on a system using 8 x NVIDIA A100-SXM4 (40 GB RAM) and 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz. As I already said, did not scale with more than 4 gpus with a speedup close to 2 if I mentioned correctly. Kernels are simply not evaluated in parallel.

One of the problems in my setup was caused by pythons multiprocessing. I could not figured out the specific problem but using MPI instead of multiprocessing pool helped. Moreover, I that running without --normalize_sample_rate false --audio_sample_rate 16000 so doing resampling on CPU while running give problems satisfying 8 GPUs with data.

Moreover I switched to Horovod for multi gpu usage but it is not supported in official STT

To sum it up, from my perspective you will not benefit from running on more than 2 A100 devices, since this seems to be in no interested of the devs at the moment.

SuperKogito commented 2 years ago

Hello @NanoNabla and thank you for your response, Well I am currently training a dutch model using one gpu (RTX 2070 with 8 Gb of vram + 12 cores) and it takes too long since I am training with a lot of data, almost 1 day per epoch (batch size is 1). Now I am trying to speed things up so I am trying to find out the optimal setup to train with. Also can someone please explain why do we need so much GPU-ram like with this setup over 4gb are occupied and when I try a batch of 2 it overflows O.o Again thank you for your response. This helps a bit. I really hope that the devs and other users share their experience with us too.

NanoNabla commented 2 years ago

I think with 4gb gpu memory you should be able to use bigger batch sizes than 1. Using bigger batch sizes will result in less training time.

Nevertheless, from me opinion this is not related to this issue. Maybe open an own issue.

FivomFive commented 2 years ago

@NanoNabla could you share your version with Horovod and MPI if it works better? Unfortunately, this issue seems to be in no interest of the developers

wasertech commented 1 year ago

Before we can solve this issue, we need to understand where it comes from.

First thing we need, is to make sure is that the issue lies within STT code base and not Nvidia TF image.

To do that, someone with more than 2 GPU (ideally 4 or more) should try to simulate a fake load on nvidia tensorflow base image (nvcr.io/nvidia/tensorflow:22.02-tf1-py3) to see if the process can scale on all GPUs. (give us some logs)

If it doesn't, it means the issue is within Nvidia TF image not STT, otherwise it's a STT specific issue and we should try to hunt it down.

... since this seems to be in no interested of the devs

We take interest in all issues. Doesn't mean we have to react to every single one of them. Specially when we don't know what is causing the issue in the first place.

Please make the requested test and let us know the results, we shall see how to deal with it.

EDIT:

@NanoNabla , you said:

I checked by environment using a simple mnist example where this problem does not occur.

This seem to indicate that it's an issue with STT, could you share your MNIST load so that other can confirm your results? (@FivomFive maybe?) (give us some logs)

If it turns out it's really only an issue with STT, we need to find exactly what is causing this behavior but I have no idea how we could even begin to diagnose this issue.

arbianqx commented 1 year ago

Hi guys,

I'm experiencing the same issue. Unable to use more than 3 GPUs at the same time. Currently consists of 8GPU (nVidia A100)

I tried using different coqui docker images and the outcome is the same. I also tried using different GPUs (nVidia GeForce 1660) and the outcome was the same which means 3 GPUs were being utilised.

Next thing that I tried was to update nvidia tensorflow base image to the latest version on the Dockerfile.train. The custom build had the same outcome!

@wasertech I can provide logs or anything that is needed in order to helpful to you guys.

Kind regards,

NanoNabla commented 1 year ago

Hi guys,

nice to hear that this issue becomes active, now and other people are also runnning the same issue.

We take interest in all issues. Doesn't mean we have to react to every single one of them. Specially when we don't know what is causing the issue in the first place.

To be honest, I also did not expect any further activities here since there were responds in nearly 5 month....

My project using STT will end soon therefore I only will have sparse time to spend in futher STT support.

This seem to indicate that it's an issue with STT, could you share your MNIST load so that other can confirm your results? (@FivomFive maybe?) (give us some logs)

It was just some random MNIST code for TensorFlow 1 I found on github if I remember correctly to check I kernels are excecuted in parallel.

For better Benchmarking results, maybe we should use old TensorFlow Benchmarks since they still work with TensorFlow 1.

I can provide output logs and/or Nsight profiles for this benchmark or any other small TensorFlow ones if you wish to.

Moreover, as I already mentioned, using Horovod Frameworks kernels are executed pararllel but than we ran into other problems on our system. Maybe I could provide some code, if you are interested in it.

HarikalarKutusu commented 1 year ago

To be honest, I also did not expect any further activities here since there were responds in nearly 5 month....

No offense, but not everybody has 3+ GPU systems, let alone 8xA100's :) I know I don't have them :D

arbianqx commented 1 year ago

If we look here, we can see that they used a server with 8GPU nVidia A100 for their checkpoints. Why are we unable to do so? https://github.com/coqui-ai/STT/releases/tag/v1.0.0

wasertech commented 1 year ago

they used a server with 8GPU nVidia A100 for their checkpoints

Yes but release 1.0.0 is based on DeepSpeech v0.9.3.

Why are we unable to do so?

@arbianqx You can try with DeepSpeech and let us know if you encounter a similar behavior. It might help us understand where it comes from.

arbianqx commented 1 year ago

I see. I used the latest deepspeech docker image, which was published back in 2021. I was able to run on 6GPU (nVidia GeForce 1660) but unable to do so on A100 GPUs.

wasertech commented 1 year ago

unable to do so on A100 GPUs.

Why? What’s the issue? Logs please.

HarikalarKutusu commented 1 year ago

Why not try to debug it with Coqui STT v1.0.0 and these settings? https://gist.github.com/reuben/6ced6a8b41e3d0849dafb7cae301e905

If this works, than the the problem is introduced after that point. Or it is related to flags...

arbianqx commented 1 year ago

I'm trying to debug it with these params on 1.0.0.

First of all I'm not sure how much shared memory I should specify on the docker run command:

docker run --shm-size=16g ....

Running with shared memory of 16,32,64,128g (I tried all these numbers) I managed to use 8GPU only with 16 batch sizes but not 64 as it was shown on the github gist.

wasertech commented 1 year ago

First of all I'm not sure how much shared memory I should specify on the docker run command:

Just let nvidia tell you how much to fit. Run it without --shm-size and look at Nvidia's logs. For example with my two RTX Titan, I get the following message.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TensorFlow.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

I managed to use 8GPU only with 16 batch sizes but not 64

As for batch sizes, it means nothing, it's probably due to something else like data or something.

But it would mean that DS works with a high GPU count. Meaning the error was introduced after 1.0.0

arbianqx commented 1 year ago

Okay tried running to test transfer learning with Coqui 1.0.0 using 8GPU A100, with batch_size 24 Is it safe to ignore the following error?

`I STARTING Optimization Epoch 0 | Training | Elapsed Time: 0:41:12 | Steps: 1756 | Loss: 202.880101
Epoch 0 | Validation | Elapsed Time: 0:02:15 | Steps: 82 | Loss: 166.941001 | Dataset: /dev.csv
I Saved new best validating model with loss 166.941001 to: /export/best_dev-3665637

Epoch 1 | Training | Elapsed Time: 0:40:38 | Steps: 1756 | Loss: 149.227847
Epoch 1 | Validation | Elapsed Time: 0:02:24 | Steps: 82 | Loss: 139.975071 | Dataset: /dev.csv
I Saved new best validating model with loss 139.975071 to: /export/best_dev-3667393

Epoch 2 | Training | Elapsed Time: 0:40:34 | Steps: 1756 | Loss: 134.436629
Epoch 2 | Validation | Elapsed Time: 0:02:26 | Steps: 82 | Loss: 130.424641 | Dataset: /dev.csv
I Saved new best validating model with loss 130.424641 to: /export/best_dev-3669149

Epoch 3 | Training | Elapsed Time: 0:40:10 | Steps: 1756 | Loss: 127.000162
Epoch 3 | Validation | Elapsed Time: 0:02:25 | Steps: 82 | Loss: 124.543184 | Dataset: /dev.csv
I Saved new best validating model with loss 124.543184 to: /export/best_dev-3670905

Epoch 4 | Training | Elapsed Time: 0:41:25 | Steps: 1756 | Loss: 122.080435
Epoch 4 | Validation | Elapsed Time: 0:02:35 | Steps: 82 | Loss: 120.453207 | Dataset: /dev.csv
I Saved new best validating model with loss 120.453207 to: /export/best_dev-3672661

I FINISHED optimization in 3:37:01.232450 W Specifying --test_files when calling train module. Use python -m coqui_stt_training.evaluate Using the training module as a generic driver for all training related functionality is deprecated and will be removed soon. Use the specific modules: W python -m coqui_stt_training.train W python -m coqui_stt_training.evaluate W python -m coqui_stt_training.export W python -m coqui_stt_training.training_graph_inference I Loading best validating checkpoint from /transfer-learning/best_dev-3663881 I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel I Loading variable from checkpoint: global_step I Loading variable from checkpoint: layer_1/bias I Loading variable from checkpoint: layer_1/weights I Loading variable from checkpoint: layer_2/bias I Loading variable from checkpoint: layer_2/weights I Loading variable from checkpoint: layer_3/bias I Loading variable from checkpoint: layer_3/weights I Loading variable from checkpoint: layer_5/bias I Loading variable from checkpoint: layer_5/weights I Loading variable from checkpoint: layer_6/bias Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/code/training/coqui_stt_training/train.py", line 705, in main() File "/code/training/coqui_stt_training/train.py", line 685, in main evaluate.test() File "/code/training/coqui_stt_training/evaluate.py", line 173, in test samples = evaluate(Config.test_files, create_model) File "/code/training/coqui_stt_training/evaluate.py", line 99, in evaluate load_graph_for_evaluation(session) File "/code/training/coqui_stt_training/util/checkpoints.py", line 166, in load_graph_for_evaluation _load_or_init_impl(session, methods, allow_drop_layers=False) File "/code/training/coqui_stt_training/util/checkpoints.py", line 110, in _load_or_init_impl session, ckpt_path, allow_drop_layers, allow_lr_init=allow_lr_init File "/code/training/coqui_stt_training/util/checkpoints.py", line 80, in _load_checkpoint v.load(ckpt.get_tensor(v.op.name), session=session) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 330, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variables.py", line 1033, in load session.run(self.initializer, {self.initializer.inputs[1]: value}) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1156, in _run (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape()))) ValueError: Cannot feed value of shape (29,) for Tensor 'layer_6/bias/Initializer/zeros:0', which has shape '(30,)' `

NanoNabla commented 1 year ago

No offense, but not everybody has 3+ GPU systems, let alone 8xA100's :) I know I don't have them :D

I hoped to get some support even if the hardware is not available to the devs. Moreover, it seems I'm not the only one with access to such a system, which is btw not unusal in hpc enviromnents ;)

If we look here, we can see that they used a server with 8GPU nVidia A100 for their checkpoints. Why are we unable to do so? https://github.com/coqui-ai/STT/releases/tag/v1.0.0

I was able to train with 8 GPU but it is not efficient. So the question is, have they checked if running on 8 gpus is faster than running just on 2 or 3 gpus. We experience many people on our system running parallel application with a huge amount of allocated ressources without checking if there application is really using it effiently.

I see. I used the latest deepspeech docker image, which was published back in 2021. I was able to run on 6GPU (nVidia GeForce 1660) but unable to do so on A100 GPUs.

Are this docker container base on Dockerfile.build.tmpl ? This docker are using FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04. As far as I know running standard TensorFlow 1 provided by Google is not supported by newer gpu architectured such as A100. Nvidia provides support extra TensorFlow 1 source code and docker images which are already used here in STT.

lissyx commented 1 year ago

@reuben can correct me, but as much as I remember, the evaluation code had limitations that made it unable to run on parallel GPU / CPUs, from the beginning. Fixing that is non trivial

wasertech commented 1 year ago

@arbianqx Please format your logs correctly. The error ValueError: Cannot feed value of shape (29,) for Tensor 'layer_6/bias/Initializer/zeros:0', which has shape '(30,)' is likely due to your pre-trained model not using the same alphabet. Read fine-tuning/transfert-learning docs to know more.

Thanks @lissyx !

arbianqx commented 1 year ago

Hi,

It was an issue of transfer learning which I forgot to add alphabet.txt correctly. Now running on coqui docker image 1.0.0 and using batch_size of 16 and using all 8GPU.

On newer coqui docker images, it is unable to do so.

HarikalarKutusu commented 1 year ago

@arbianqx, so, that would mean the problem was introduced after v1.0.0 or it is dependent to the parameters. Right?

Is it consuming the power of all GPU's well?

arbianqx commented 1 year ago

@arbianqx, so, that would mean the problem was introduced after v1.0.0 or it is dependent to the parameters. Right?

Is it consuming the power of all GPU's well?

As far as I can see, it is related to the version. After v1.0.0 it stopped utilising multiple GPUs.

coqui-ai / STT