Segmentation fault while running train_conformer.py

ryantang1993 commented 3 years ago

Hello, everyone！ I was trying to running train_conformer.py on LibriSpeech dataset(in particular, dev-clean), and i've got this error below: Run on 1 Physical GPUs Model: "conformer_encoder"

Layer (type) Output Shape Param #

conformer_encoder_subsampling (Conv2dSubsampling) (None, None, 2880) 188208

conformer_encoder_pe (PositionalEncodingConcat) (1, None, 144) 0

conformer_encoder_linear (Dense) (None, None, 144) 414864

conformer_encoder_dropout (Dropout) (None, None, 144) 0

conformer_encoder_block_0 (ConformerBlock) (None, None, 144) 506736

conformer_encoder_block_1 (ConformerBlock) (None, None, 144) 506736

conformer_encoder_block_2 (ConformerBlock) (None, None, 144) 506736

conformer_encoder_block_3 (ConformerBlock) (None, None, 144) 506736

conformer_encoder_block_4 (ConformerBlock) (None, None, 144) 506736

conformer_encoder_block_5 (ConformerBlock) (None, None, 144) 506736

Total params: 3,643,488 Trainable params: 3,641,760 Non-trainable params: 1,728

Model: "conformer_prediction"

Layer (type) Output Shape Param #

conformer_prediction_embedding (Embedding) (None, None, 320) 9280

conformer_prediction_dropout (Dropout) (None, None, 320) 0

conformer_prediction_ln_0 (LayerNormalization) (None, None, 320) 640

conformer_prediction_lstm_0 (LSTM) [(None, None, 320), (None, 320), (None, 320)] 820480

Total params: 830,400 Trainable params: 830,400 Non-trainable params: 0

Model: "conformer_joint"

Layer (type) Output Shape Param #

conformer_joint_enc (Dense) (None, None, 320) 46400

conformer_joint_pred (Dense) (None, None, 320) 102400

conformer_joint_vocab (Dense) multiple 9309

Total params: 158,109 Trainable params: 158,109 Non-trainable params: 0

Model: "conformer"

Layer (type) Output Shape Param #

conformer_encoder (ConformerEncoder) (None, None, 144) 3643488

conformer_prediction (TransducerPrediction) (None, None, 320) 830400

conformer_joint (TransducerJoint) (None, None, None, 29) 158109

Total params: 4,631,997 Trainable params: 4,630,269 Non-trainable params: 1,728

Reading /media/huaxin/tcl1/asr/tanglei/work2020/work202012/TensorFlowASR_New/TensorFlowASR/examples/conformer/data/libri_train.tsv ... Reading /media/huaxin/tcl1/asr/tanglei/work2020/work202012/TensorFlowASR_New/TensorFlowASR/examples/conformer/data/libri_dev.tsv ... [Train] | | 0/14980 [00:00<?, ?batch/s]./train_conformer.sh: line 1: 4671 Segmentation fault (core dumped) python ./examples/conformer/train_conformer.py --config ./examples/conformer/config.yml --tbs 2 --ebs 2 --devices 0

Some information about the system and software is as follows: lsb_release -a: No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 16.04 LTS Release: 16.04 Codename: xenial

gcc --version gcc (Ubuntu 4.9.3-13ubuntu2) 4.9.3

g++ --version g++ (Ubuntu 4.9.3-13ubuntu2) 4.9.3

pip list: ... ctc-decoders 1.1 tensorboard 2.4.0 tensorboard-plugin-wit 1.7.0 tensorflow-addons 0.11.2 tensorflow-datasets 3.2.1 tensorflow-estimator 2.3.0 tensorflow-gpu 2.3.1 tensorflow-metadata 0.26.0 TensorFlowASR 0.4.3 termcolor 1.1.0 threadpoolctl 2.1.0 tqdm 4.54.1 typeguard 2.10.0 typing-extensions 3.7.4.3 urllib3 1.26.2 warprnnt-tensorflow 0.1

CUDNN:v7.6.4

This is very strange. I can't find the wrong place for the moment. Do you have any ideas?Thanks very much

ryantang1993 commented 3 years ago

Oh, here is my running config: model_config: encoder_num_blocks: 6

dataset_config: train_paths:

/media/huaxin/tcl1/asr/tanglei/work2020/work202012/TensorFlowASR_New/TensorFlowASR/examples/conformer/data/libri_train.tsv eval_paths:
/media/huaxin/tcl1/asr/tanglei/work2020/work202012/TensorFlowASR_New/TensorFlowASR/examples/conformer/data/libri_dev.tsv test_paths:
/media/huaxin/tcl1/asr/tanglei/work2020/work202012/TensorFlowASR_New/TensorFlowASR/examples/conformer/data/libri_test.tsv

running_config: batch_size: 2 accumulation_steps: 4 num_epochs: 20 outdir: /media/huaxin/tcl1/asr/tanglei/work2020/work202012/TensorFlowASR_New/TensorFlowASR/examples/conformer/model log_interval_steps: 300 eval_interval_steps: 500 save_interval_steps: 1000

and the tsv data file is just like:

PATH DURATION TRANSCRIPT /media/huaxin/tcl1/asr/tanglei/dataset/LibriSpeech/LibriSpeech/dev-clean/2428/83699/2428-83699-0000.wav 13.30 i imagine there were several kinds of old fashioned christmases but it could hardly be worse than a chop in my chambers or horror of horrors at the club or my cousin lucy's notion of what she calls the festive season /media/huaxin/tcl1/asr/tanglei/dataset/LibriSpeech/LibriSpeech/dev-clean/2428/83699/2428-83699-0001.wav 2.07 festive yes ...

nglehuy commented 3 years ago

Try setting up using anaconda3 like instruction in readme (anaconda provides installing cuda and cudnn in the env) to see if there's still error.

ryantang1993 commented 3 years ago

@usimarit I've tried anaconda3 to create an virtual environment, and follow the instruction in readme, and got the same error. And than I ran the DeepSpeech2 code successfully, so the problem should come from warp-transducer.

ryantang1993 commented 3 years ago

Solved! After downloading the source code for TensorFlow and compiling it using GCC5.4 and G++5.4, the problem was resolved.

TensorSpeech / TensorFlowASR

Segmentation fault while running train_conformer.py #77