Low GPU utilization for multi-GPU training

tann9949 commented 3 years ago

I have trained a Conformer model using my own custom dataset in Thai. However, GPU Utilization seems to be pretty low as the training speed is pretty slow (~2 s/batch). The GPU was utilized by around 5-10%. Are there anyways to debug this problem?

For training, I simply edited config.yaml in examples/conformer/config.yaml and run

$ python examples/conformer/train_conformer.py --device 0 1 2 3

Software Specification: OS: Debian GNU/Linux 10 (buster) (GNU/Linux 4.19.0-9-cloud-amd64 x86_64\n) GPUs: Nvidia Tesla V100 16Gb RAM Installed by building from source

config.yaml

speech_config:
  sample_rate: 16000
  frame_ms: 25
  stride_ms: 10
  num_feature_bins: 80
  feature_type: log_mel_spectrogram
  preemphasis: 0.97
  normalize_signal: True
  normalize_feature: True
  normalize_per_feature: False

decoder_config:
  vocabulary: vocabularies/thai.characters
  target_vocab_size: 1024
  max_subword_length: 4
  blank_at_zero: True
  beam_width: 5
  norm_score: True

model_config:
  name: conformer
  subsampling:
    type: conv2d
    filters: 144
    kernel_size: 3
    strides: 2
  positional_encoding: sinusoid_concat
  dmodel: 144
  num_blocks: 16
  head_size: 36
  num_heads: 4
  mha_type: relmha
  kernel_size: 32
  fc_factor: 0.5
  dropout: 0.1
  embed_dim: 320
  embed_dropout: 0.1
  num_rnns: 1
  rnn_units: 320
  rnn_type: lstm
  layer_norm: True
  joint_dim: 320

learning_config:
  augmentations:
    after:
      time_masking:
        num_masks: 10
        mask_factor: 100
        p_upperbound: 0.05
      freq_masking:
        num_masks: 1
        mask_factor: 27

  dataset_config:
    train_paths:
      - /home/chompk/trainv1_trainscript.tsv
    eval_paths:
      -  /home/chompk/valv1_trainscript.tsv
    test_paths:
      - /mnt/d/SpeechProcessing/Datasets/LibriSpeech/test-clean/transcripts.tsv
    tfrecords_dir: null

  optimizer_config:
    warmup_steps: 40000
    beta1: 0.9
    beta2: 0.98
    epsilon: 1e-9

  running_config:
    batch_size: 4
    accumulation_steps: 4
    num_epochs: 20
    outdir: /mnt/d/SpeechProcessing/Trained/local/conformer
    log_interval_steps: 300
    eval_interval_steps: 500
    save_interval_steps: 1000

GPU Utilization

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    58W / 300W |  15752MiB / 16130MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:05.0 Off |                    0 |
| N/A   38C    P0    66W / 300W |  15704MiB / 16130MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:06.0 Off |                    0 |
| N/A   40C    P0    66W / 300W |  15752MiB / 16130MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   39C    P0    58W / 300W |  15704MiB / 16130MiB |      7%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      9598      C   python                                     15741MiB |
|    1      9598      C   python                                     15693MiB |
|    2      9598      C   python                                     15741MiB |
|    3      9598      C   python                                     15693MiB |
+-----------------------------------------------------------------------------+

Training Steps Example

> Start evaluation ...
[Eval] [Step 1000] |████████████████████| 4423/4423 [22:08<00:00,  3.33batch/s, transducer_loss=171.865
> End evaluation ...
[Train] [Epoch 1/20] |                    | 1500/796100 [1:38:28<421:04:34,  1.91s/batch, transducer_loss=159.42458]
> Start evaluation ...
[Eval] [Step 1500] |████████████████████| 4423/4423 [23:15<00:00,  3.17batch/s, transducer_loss=153.2395]
> End evaluation ...
[Train] [Epoch 1/20] |                    | 2000/796100 [2:18:06<456:58:56,  2.07s/batch, transducer_loss=140.7582]
> Start evaluation ...
[Eval] [Step 2000] |████████████████████| 4423/4423 [22:36<00:00,  3.26batch/s, transducer_loss=137.00543]
> End evaluation ...
[Train] [Epoch 1/20] |                    | 2500/796100 [2:57:05<409:56:45,  1.86s/batch, transducer_loss=126.64603]
> Start evaluation ...
[Eval] [Step 2500] |████████████████████| 4423/4423 [22:52<00:00,  3.22batch/s, transducer_loss=126.15583]
> End evaluation ...
[Train] [Epoch 1/20] |                    | 2648/796100 [3:23:48<506:25:46,  2.30s/batch, transducer_loss=125.96002

nglehuy commented 3 years ago

This is weird, try using --mxp, mixed precision is faster. Anyway, I've tested on rtx 2080ti and gpu usage is around 30-70%.

bill-kalog commented 3 years ago

you could also try caching the dataset using the cache flag if you have enough ram. After the first epoch I had at least 2 batch/s if I remember correctly using T4 with high gpu-util

tann9949 commented 3 years ago

you could also try caching the dataset using the cache flag if you have enough ram. After the first epoch I had at least 2 batch/s if I remember correctly using T4 with high gpu-util

Sorry for the stupid question, how do I use cache flag?

bill-kalog commented 3 years ago

@tann9949 sure just pass --cache in your training call you can use tfrecords too --tfrecords and specify a directory for them to be stored in the tfrecords_dir: inside your config.yml the records will be created for you if they don't exist yet the first time you run the training script

tann9949 commented 3 years ago

@bill-kalog Thanks for the reply!

@tann9949 sure just pass --cache in your training call you can use tfrecords too --tfrecords and specify a directory for them to be stored in the tfrecords_dir: inside your config.yml the records will be created for you if they don't exist yet the first time you run the training script

I've tried this method and this speeds up by ~1.2 s/batch. Still, GPU utilization is still merely 5%

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    58W / 300W |  15752MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:05.0 Off |                    0 |
| N/A   39C    P0    58W / 300W |  15704MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:06.0 Off |                    0 |
| N/A   40C    P0    57W / 300W |  15752MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   39C    P0    57W / 300W |  15704MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     18286      C   python                                     15741MiB |
|    1     18286      C   python                                     15693MiB |
|    2     18286      C   python                                     15741MiB |
|    3     18286      C   python                                     15693MiB |
+-----------------------------------------------------------------------------+

I'm not sure if it's about my sequence length of the audio file, my maximum sequence length is around 30 seconds. I have no idea how to debug this problem.

nglehuy commented 3 years ago

@tann9949 I noticed from step 2500, the speed reduces from 3.1batch/s to 2s/batch. Is the usage still around 5% within that first 2500 steps?

tann9949 commented 3 years ago

The 2s/batch was during training, 3.1batch/s was during eval. The GPU usage is around 0-5% during training while 20-40% during eval

nglehuy commented 3 years ago

@tann9949 my bad, so the problem might lie in optimizer.apply_gradients for tape.gradients. Can you try running with train_ga_conformer.py? It's gradient accumulation training

tann9949 commented 3 years ago

For some reason, this gets even worse

nglehuy commented 3 years ago

@tann9949 Yeah, as expected, because it runs with larger batch size. Can you train on librispeech? So that I can see whether it is because of GPU or the code or the data.

tann9949 commented 3 years ago

@usimarit I've run model training on librispeech with this configuration

config.yaml

speech_config:
  sample_rate: 16000
  frame_ms: 25
  stride_ms: 10
  num_feature_bins: 80
  feature_type: log_mel_spectrogram
  preemphasis: 0.97
  normalize_signal: True
  normalize_feature: True
  normalize_per_feature: False

decoder_config:
  vocabulary: null
  target_vocab_size: 1024
  max_subword_length: 4
  blank_at_zero: True
  beam_width: 5
  norm_score: True

model_config:
  name: conformer
  subsampling:
    type: conv2d
    filters: 144
    kernel_size: 3
    strides: 2
  positional_encoding: sinusoid_concat
  dmodel: 144
  num_blocks: 16
  head_size: 36
  num_heads: 4
  mha_type: relmha
  kernel_size: 32
  fc_factor: 0.5
  dropout: 0.1
  embed_dim: 320
  embed_dropout: 0.1
  num_rnns: 1
  rnn_units: 320
  rnn_type: lstm
  layer_norm: True
  joint_dim: 320

learning_config:
  augmentations:
    after:
      time_masking:
        num_masks: 10
        mask_factor: 100
        p_upperbound: 0.05
      freq_masking:
        num_masks: 1
        mask_factor: 27

  dataset_config:
    train_paths:
      - /home/chompk/librispeech/LibriSpeech/train-clean-100/transcript.tsv
    eval_paths:
      - /home/chompk/librispeech/LibriSpeech/dev-clean/transcript.tsv
      - /home/chompk/librispeech/LibriSpeech/dev-other/transcript.tsv
    test_paths:
      - /home/chompk/librispeech/LibriSpeech/test-clean/transcript.tsv
    tfrecords_dir: /home/chompk/tfrecords_data

  optimizer_config:
    warmup_steps: 40000
    beta1: 0.9
    beta2: 0.98
    epsilon: 1e-9

  running_config:
    batch_size: 4
    accumulation_steps: 4
    num_epochs: 20
    outdir: /home/chompk/conformer_libri
    log_interval_steps: 300
    eval_interval_steps: 500
    save_interval_steps: 1000

Still, GPU utilization is around 0%

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0    63W / 300W |  15704MiB / 16130MiB |      9%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:05.0 Off |                    0 |
| N/A   40C    P0    69W / 300W |  15752MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:06.0 Off |                    0 |
| N/A   39C    P0    65W / 300W |  15752MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   40C    P0    60W / 300W |  15704MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      9007      C   python                                     15693MiB |
|    1      9007      C   python                                     15741MiB |
|    2      9007      C   python                                     15741MiB |
|    3      9007      C   python                                     15693MiB |
+-----------------------------------------------------------------------------+

The training was around 2.06s/batch (without gradient accumulation)

[Train] [Epoch 1/20] |▏                   | 258/35660 [12:16<22:02:21,  2.24s/batch, transducer_loss=1021.1291]

tann9949 commented 3 years ago

I've also tried Librispeech training with train_ga_conformer.py. Still have a worse performance but better GPU utilization

[Train] [Epoch 1/20] |                    | 8/8900 [10:58<141:30:15, 57.29s/batch, transducer_loss=1505.1484]

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0    94W / 300W |  15626MiB / 16130MiB |     15%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:05.0 Off |                    0 |
| N/A   41C    P0    69W / 300W |  15626MiB / 16130MiB |     21%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:06.0 Off |                    0 |
| N/A   39C    P0    66W / 300W |  15626MiB / 16130MiB |     32%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   40C    P0    60W / 300W |  15626MiB / 16130MiB |     25%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     15032      C   python                                     15615MiB |
|    1     15032      C   python                                     15615MiB |
|    2     15032      C   python                                     15615MiB |
|    3     15032      C   python                                     15615MiB |
+-----------------------------------------------------------------------------+

nglehuy commented 3 years ago

@tann9949 So the problem is not from dataset, so it might be tensorflow and this GPU V100

nglehuy commented 3 years ago

@tann9949 how is the CPU utilization? is it 100%?

tann9949 commented 3 years ago

I've used 8 vCPUs, 30 GB memory. Each CPU core usage was around 40-50%. I'm not sure whether it's a bottleneck on feature extraction or not but from what I've tried, Using TFRecords tends to speed up the most (from ~2.5s/batch -> ~1.4s/batch). Using --mxp and --cache doesn't help that much.

tann9949 commented 3 years ago

I think you can try to reproduce my error by training Librispeech on google's VM using image pytorch-1-4-cu101.

dathudeptrai commented 3 years ago

@tann9949 please use https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras to trace what is the reason :)))

mjurkus commented 3 years ago

I'm also suffering from log GPU utilization, but even with a single GPU. See graph below. Detals:

TensorFlowASR v0.7.1
train_ga_conformer
config - see below.

If I try to increase batch size - it fails with OOM instantly so it's the best I could get.

Config:

speech_config:
  sample_rate: 16000
  frame_ms: 25
  stride_ms: 10
  num_feature_bins: 80
  feature_type: log_mel_spectrogram
  preemphasis: 0.97
  normalize_signal: True
  normalize_feature: True
  normalize_per_feature: False

decoder_config:
  vocabulary: vocabularies/lithuanian.characters
  target_vocab_size: 4096
  max_subword_length: 4
  blank_at_zero: True
  beam_width: 5
  norm_score: True

model_config:
  name: conformer
  encoder_subsampling:
    type: conv2d
    filters: 144
    kernel_size: 3
    strides: 2
  encoder_positional_encoding: sinusoid_concat
  encoder_dmodel: 144
  encoder_num_blocks: 16
  encoder_head_size: 36
  encoder_num_heads: 4
  encoder_mha_type: relmha
  encoder_kernel_size: 32
  encoder_fc_factor: 0.5
  encoder_dropout: 0.1
  prediction_embed_dim: 320
  prediction_embed_dropout: 0
  prediction_num_rnns: 1
  prediction_rnn_units: 320
  prediction_rnn_type: lstm
  prediction_rnn_implementation: 2
  prediction_layer_norm: False
  prediction_projection_units: 0
  joint_dim: 640
  joint_activation: tanh

learning_config:
  train_dataset_config:
    use_tf: True
    augmentation_config:
      after:
        time_masking:
          num_masks: 10
          mask_factor: 100
          p_upperbound: 0.05
        freq_masking:
          num_masks: 1
          mask_factor: 27
    data_paths:
      - /tf_asr/manifests/cc_manifest_train.tsv
    tfrecords_dir: /tf_asr/tfrecords/cc/tfrecords-train
    shuffle: True
    cache: True
    buffer_size: 100
    drop_remainder: True

  eval_dataset_config:
    use_tf: True
    data_paths:
      - /tf_asr/manifests/cc_manifest_eval.tsv
    tfrecords_dir: /tf_asr/tfrecords/cc/tfrecords-eval
    shuffle: False
    cache: True
    buffer_size: 100
    drop_remainder: True

  test_dataset_config:
    use_tf: True
    data_paths:
      - /tf_asr/manifests/cc_manifest_test.tsv
    tfrecords_dir: /tf_asr/tfrecords/cc/tfrecords-test
    shuffle: False
    cache: True
    buffer_size: 100
    drop_remainder: True

  optimizer_config:
    warmup_steps: 40000
    beta1: 0.9
    beta2: 0.98
    epsilon: 1e-9

  running_config:
    batch_size: 8
    accumulation_steps: 16
    num_epochs: 20
    outdir: /tf_asr/models
    log_interval_steps: 300
    eval_interval_steps: 500
    save_interval_steps: 1000
    checkpoint:
      filepath: /tf_asr/models/checkpoints/{epoch:02d}.h5
      save_best_only: True
      save_weights_only: False
      save_freq: epoch
    states_dir: /tf_asr/models/states
    tensorboard:
      log_dir: /tf_asr/models/tensorboard
      histogram_freq: 1
      write_graph: True
      write_images: True
      update_freq: 'epoch'
      profile_batch: 2

nglehuy commented 3 years ago

I've been testing some models and I see that models which does not have RNN will take like 90-100% GPU utilization and models have at least 1 RNN will take like from 25-70% GPU utilization. Do you guys have any idea to improve GPU utilization for RNN?

slesinger commented 3 years ago

I also suffered by GPU utilized only low and occasionaly. Switching to Keras helped.

nglehuy commented 3 years ago

Do you guys still have this issue?

flaviofafe1414 commented 3 years ago

I solved the problem by installing the cuda and cudnn drives through sudo from linux, for some reason, when installing through anaconda this problem happens

TensorSpeech / TensorFlowASR

Low GPU utilization for multi-GPU training #50