Issues running conformer example with RTX 3090

jiwidi commented 3 years ago

Hi!

First of all, very nice repository you have. Great work, I like your work.

I've been trying to run your example for the conformer with a rtx 3090 from the new nvidia series and I was wondering if its something you have tried/tested or even support.

Im running cuda 11.1 and cudnn cudnn-11.1-v8.0.5.39 and I tried running your installation commands with conda:

conda create -y -n tfasr tensorflow-gpu python=3.7 # tensorflow if using CPU
conda activate tfasr
pip install -U tensorflow-gpu # upgrade to latest version of tensorflow 
git clone https://github.com/TensorSpeech/TensorFlowASR.git
cd TensorFlowASR
python setup.py install

Then install the rnnt_loss with

export CUDA_HOME=/usr/local/cuda && ./scripts/install_rnnt_loss.sh

And got this output

Cloning into 'warp-transducer'...
remote: Enumerating objects: 20, done.
remote: Counting objects: 100% (20/20), done.
remote: Compressing objects: 100% (16/16), done.
remote: Total 914 (delta 2), reused 9 (delta 1), pack-reused 894
Receiving objects: 100% (914/914), 252.69 KiB | 1014.00 KiB/s, done.
Resolving deltas: 100% (463/463), done.
-- The C compiler identification is GNU 10.2.0
-- The CXX compiler identification is GNU 10.2.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found version "11.1") 
-- cuda found TRUE
-- Building shared library with GPU support
-- Configuring done
-- Generating done
-- Build files have been written to: /mnt/kingston/github/TensorFlowASR/externals/warp-transducer/build
[  7%] Building NVCC (Device) object CMakeFiles/warprnnt.dir/src/warprnnt_generated_rnnt_entrypoint.cu.o
nvcc fatal   : Unsupported gpu architecture 'compute_30'
CMake Error at warprnnt_generated_rnnt_entrypoint.cu.o.cmake:220 (message):
  Error generating
  /mnt/kingston/github/TensorFlowASR/externals/warp-transducer/build/CMakeFiles/warprnnt.dir/src/./warprnnt_generated_rnnt_entrypoint.cu.o

make[2]: *** [CMakeFiles/warprnnt.dir/build.make:65: CMakeFiles/warprnnt.dir/src/warprnnt_generated_rnnt_entrypoint.cu.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:191: CMakeFiles/warprnnt.dir/all] Error 2
make: *** [Makefile:130: all] Error 2
2020-11-22 19:08:46.598045: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Could not find libwarprnnt.so in ../build.
Build warp-rnnt and set WARP_RNNT_PATH to the location of libwarprnnt.so (default is '../build')

It looked like the warp-transducer wont compile so following this post I commented the following lines of the warp-transducer cmake file:

# set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_30,code=sm_30 -O2")
# set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_35,code=sm_35")

set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_50,code=sm_50")
# set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_52,code=sm_52")

And I was able to compile it. After this I tried running the example python examples/conformer/train_conformer.py but it wont start running due to a gpu error:

Run on 1 Physical GPUs
Traceback (most recent call last):
  File "examples/conformer/train_conformer.py", line 57, in <module>
    strategy = setup_strategy(args.devices)
  File "/home/jiwidi/anaconda3/envs/tf/lib/python3.7/site-packages/TensorFlowASR-0.3.1-py3.7.egg/tensorflow_asr/utils/__init__.py", line 63, in setup_strategy
  File "/home/jiwidi/anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 269, in __init__
    self, devices=devices, cross_device_ops=cross_device_ops)
  File "/home/jiwidi/anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 306, in __init__
    devices = devices or all_local_devices()
  File "/home/jiwidi/anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 172, in all_local_devices
    devices = config.list_logical_devices("GPU")
  File "/home/jiwidi/anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow/python/framework/config.py", line 403, in list_logical_devices
    return context.context().list_logical_devices(device_type=device_type)
  File "/home/jiwidi/anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 1344, in list_logical_devices
    self.ensure_initialized()
  File "/home/jiwidi/anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 539, in ensure_initialized
    context_handle = pywrap_tfe.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

So I upgraded tensorflow to pip install tf-nightly-gpu==2.5.0.dev20201028 and solved it. Now im able to run the code in the example script but I have loss equal to 0 and I wonder if this is something normal or could be a bug from my installation

[Train] [Epoch 1/20] |                    | 25/142680 [05:19<504:44:03, 12.74s/batch, transducer_loss=0.0]

Is this related to the tf version or the warp-transducer version? Has anyone run examples from this repository with the new nvidia 3000 cards? Could you provide me with some information about your installation?

Here is the full output from my execution of the conformer example:

Run on 1 Physical GPUs
Model: "conformer_encoder"
________________________________________________________________________________________________________________________
Layer (type)                                          Output Shape                                    Param #           
========================================================================================================================
conformer_encoder_subsampling (Conv2dSubsampling)     multiple                                        188208            
________________________________________________________________________________________________________________________
conformer_encoder_pe (PositionalEncodingConcat)       multiple                                        0                 
________________________________________________________________________________________________________________________
conformer_encoder_linear (Dense)                      multiple                                        414864            
________________________________________________________________________________________________________________________
conformer_encoder_dropout (Dropout)                   multiple                                        0                 
________________________________________________________________________________________________________________________
conformer_encoder_block_0 (ConformerBlock)            multiple                                        506736            
________________________________________________________________________________________________________________________
conformer_encoder_block_1 (ConformerBlock)            multiple                                        506736            
________________________________________________________________________________________________________________________
conformer_encoder_block_2 (ConformerBlock)            multiple                                        506736            
________________________________________________________________________________________________________________________
conformer_encoder_block_3 (ConformerBlock)            multiple                                        506736            
________________________________________________________________________________________________________________________
conformer_encoder_block_4 (ConformerBlock)            multiple                                        506736            
________________________________________________________________________________________________________________________
conformer_encoder_block_5 (ConformerBlock)            multiple                                        506736            
________________________________________________________________________________________________________________________
conformer_encoder_block_6 (ConformerBlock)            multiple                                        506736            
________________________________________________________________________________________________________________________
conformer_encoder_block_7 (ConformerBlock)            multiple                                        506736            
________________________________________________________________________________________________________________________
conformer_encoder_block_8 (ConformerBlock)            multiple                                        506736            
________________________________________________________________________________________________________________________
conformer_encoder_block_9 (ConformerBlock)            multiple                                        506736            
________________________________________________________________________________________________________________________
conformer_encoder_block_10 (ConformerBlock)           multiple                                        506736            
________________________________________________________________________________________________________________________
conformer_encoder_block_11 (ConformerBlock)           multiple                                        506736            
________________________________________________________________________________________________________________________
conformer_encoder_block_12 (ConformerBlock)           multiple                                        506736            
________________________________________________________________________________________________________________________
conformer_encoder_block_13 (ConformerBlock)           multiple                                        506736            
________________________________________________________________________________________________________________________
conformer_encoder_block_14 (ConformerBlock)           multiple                                        506736            
________________________________________________________________________________________________________________________
conformer_encoder_block_15 (ConformerBlock)           multiple                                        506736            
========================================================================================================================
Total params: 8,710,848
Trainable params: 8,706,240
Non-trainable params: 4,608
________________________________________________________________________________________________________________________
Model: "conformer_prediction"
________________________________________________________________________________________________________________________
Layer (type)                                          Output Shape                                    Param #           
========================================================================================================================
conformer_prediction_embedding (Embedding)            multiple                                        9280              
________________________________________________________________________________________________________________________
conformer_prediction_dropout (Dropout)                multiple                                        0                 
________________________________________________________________________________________________________________________
conformer_prediction_ln_0 (LayerNormalization)        multiple                                        640               
________________________________________________________________________________________________________________________
conformer_prediction_lstm_0 (LSTM)                    multiple                                        820480            
========================================================================================================================
Total params: 830,400
Trainable params: 830,400
Non-trainable params: 0
________________________________________________________________________________________________________________________
Model: "conformer_joint"
________________________________________________________________________________________________________________________
Layer (type)                                          Output Shape                                    Param #           
========================================================================================================================
conformer_joint_enc (Dense)                           multiple                                        46400             
________________________________________________________________________________________________________________________
conformer_joint_pred (Dense)                          multiple                                        102400            
________________________________________________________________________________________________________________________
conformer_joint_vocab (Dense)                         multiple                                        9309              
========================================================================================================================
Total params: 158,109
Trainable params: 158,109
Non-trainable params: 0
________________________________________________________________________________________________________________________
Model: "conformer"
________________________________________________________________________________________________________________________
Layer (type)                                          Output Shape                                    Param #           
========================================================================================================================
conformer_encoder (ConformerEncoder)                  multiple                                        8710848           
________________________________________________________________________________________________________________________
conformer_prediction (TransducerPrediction)           multiple                                        830400            
________________________________________________________________________________________________________________________
conformer_joint (TransducerJoint)                     multiple                                        158109            
========================================================================================================================
Total params: 9,699,357
Trainable params: 9,694,749
Non-trainable params: 4,608
________________________________________________________________________________________________________________________
Reading /mnt/kingston/asr-datasets/LibriSpeech/train-clean-100/transcripts.tsv ...
2020-11-22 19:36:11.117394: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:654] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Found an unshardable source dataset: name: "TensorSliceDataset/_1"
op: "TensorSliceDataset"
input: "Placeholder/_0"
attr {
  key: "Toutput_types"
  value {
    list {
      type: DT_STRING
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: 2
        }
      }
    }
  }
}

Reading /mnt/kingston/asr-datasets/LibriSpeech/dev-clean/transcripts.tsv ...
Reading /mnt/kingston/asr-datasets/LibriSpeech/dev-other/transcripts.tsv ...
2020-11-22 19:36:11.158860: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:654] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Found an unshardable source dataset: name: "TensorSliceDataset/_1"
op: "TensorSliceDataset"
input: "Placeholder/_0"
attr {
  key: "Toutput_types"
  value {
    list {
      type: DT_STRING
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: 2
        }
      }
    }
  }
}

[Train] |                    | 0/142680 [00:00<?, ?batch/s]2020-11-22 19:36:35.725016: W tensorflow/stream_executor/gpu/asm_compiler.cc:63] Running ptxas --version returned 256
2020-11-22 19:36:35.808323: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 256, output: 
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
[Train] [Epoch 1/20] |                    | 1/142680 [00:37<1497:40:58, 37.79s/batch, transducer_loss=0.0]

Thanks

nglehuy commented 3 years ago

Loss 0.0 means that there's something wrong with rnnt installation 😢

jiwidi commented 3 years ago

Loss 0.0 means that there's something wrong with rnnt installation 😢

Yeah thought so too, do you have any tips on how to check the rnnt installation? Some sample code that I can work with to debug it. Installation script runs fine tho

nglehuy commented 3 years ago

@jiwidi Unfortunately, the repo warp-transducer seems to be abandoned, the test script in that repo is outdated too, so we have to implement a new one to debug and test, otherwise I don't see any other options.

danielkope commented 3 years ago

Hard to tell without seeing the CMakeList but I suspect that you are using CUDA 11. The CMakeFile probably doesn't compile for the latest architecture. I would add the appropriate SM there first before starting to debug the src itself.

if you run CUDA 11.1 you can use arch and sm for the RTX cards -arch=sm_80 \ -gencode=arch=compute_80,code=sm_80 \ -gencode=arch=compute_86,code=sm_86 \ -gencode=arch=compute_86,code=compute_86

for cuda 11 use: -arch=sm_52 -gencode=arch=compute_80,code=sm_80 \ -gencode=arch=compute_86,code=sm_86 \ -gencode=arch=compute_86,code=compute_86

danielkope commented 3 years ago

did this suggestion fix the issue?

jiwidi commented 3 years ago

did this suggestion fix the issue?

Hi! Sorry I have been out during the week and couldnt try the solution, will try it this weekend and get back to you. Thanks!

jiwidi commented 3 years ago

did this suggestion fix the issue?

Hi again! It got fixed and it have a loss !=0, gpu usage during training is very low 5% is that normal for this model? Takes 6secs per batch on the 3090. Maybe is still not running on cpu.

The rnnt transducer loss is installed with cuda found and when running the example it outputs that is running on gpu so it shouldnt be wronly running on cpu.

nglehuy commented 3 years ago

@jiwidi no that's not normal, can you use the profiler to log the training performance?

jiwidi commented 3 years ago

@jiwidi no that's not normal, can you use the profiler to log the training performance?

Yeah, will do later today, any tips using the profiler on your library?

jiwidi commented 3 years ago

@usimarit So its being a bit hard to find exactly where to put my profiling code. I can't really debug your structure of classes. From looking at the train_conformer.py example I thought the train step that was running was the function here https://github.com/TensorSpeech/TensorFlowASR/blob/e08e208f90ccc82d47751a05ce22b7d0ec78f685/tensorflow_asr/runners/transducer_runners.py#L48.

But is not, I replaced it with the following function to show the profiling:

   @tf.function(experimental_relax_shapes=True)
    def _train_step(self, batch):
        with profiler.profile(record_shapes=True) as prof:
            with profiler.record_function("model_inference"):
                _, features, input_length, labels, label_length, pred_inp = batch

                with tf.GradientTape() as tape:
                    logits = self.model([features, pred_inp], training=True)
                    tape.watch(logits)
                    per_train_loss = rnnt_loss(
                        logits=logits, labels=labels, label_length=label_length,
                        logit_length=(input_length // self.model.time_reduction_factor),
                        blank=self.text_featurizer.blank
                    )
                    train_loss = tf.nn.compute_average_loss(per_train_loss,
                                                            global_batch_size=self.global_batch_size)

                gradients = tape.gradient(train_loss, self.model.trainable_variables)
                self.optimizer.apply_gradients(zip(gradients, self.model.trainable_variables))

                self.train_metrics["transducer_loss"].update_state(per_train_loss)
            print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

And get no print. Could you point me to where I should debug with the profiling? where can I profile every batch that is run by the example.

Thanks!

danielkope commented 3 years ago

Did you rebuild the package after your code changes?

danielkope commented 3 years ago

I have tried this on a RTX 3090 myself in the meantime. There are some issues in the warpRnnt that needs to be addressed as discussed before but then it is working. I have compiled under CUDA 11.1 with the latest TensorFlow that supports CUDA 11.1 and SM86. I'm looking at > 5 batches / second with a batch size of 6. No GA.

nglehuy commented 3 years ago

I'll close this issue here since I think the rnnt loss implementation in tf can drop our dependency in gpu devices and cuda versions and leave it to tensorflow. Therefore this problem is solved. Feel free to reopen or open new issue if this problem occurs again (for rnnt loss in tf only, wrap rnnt loss is deprecated)

TensorSpeech / TensorFlowASR

Issues running conformer example with RTX 3090 #54