Closed tarunsaib1997 closed 2 years ago
@tarunsaib1997 Do you have freeze_param
in your config file and frontend
is in the freeze_param
?
No, freeze_param was not used at all in the config file. Below is the entire conf file content used:
batch_type: folded #numel batch_bins: 5000000 #140000000 batch_size: 8 accum_grad: 2 #1 max_epoch: 100 patience: none init: none best_model_criterion:
encoder: conformer encoder_conf: output_size: 512 attention_heads: 8 linear_units: 2048 num_blocks: 12 dropout_rate: 0.1 positional_dropout_rate: 0.1 attention_dropout_rate: 0.1 input_layer: conv2d normalize_before: true macaron_style: true pos_enc_layer_type: "rel_pos" selfattention_layer_type: "rel_selfattn" activation_type: "swish" use_cnn_module: true cnn_module_kernel: 15
decoder: transformer decoder_conf: attention_heads: 8 linear_units: 2048 num_blocks: 6 dropout_rate: 0.1 positional_dropout_rate: 0.1 self_attention_dropout_rate: 0.1 src_attention_dropout_rate: 0.1
model_conf: ctc_weight: 0.3 lsm_weight: 0.1 length_normalized_loss: false
optim: adam optim_conf: lr: 0.0015 scheduler: warmuplr scheduler_conf: warmup_steps: 25000
frontend: s3prl frontend_conf: frontend_conf: upstream: modified_cpc # Note: If the upstream is changed, please change the input_size in the preencoder. download_dir: ./hub multilayer_feature: True
preencoder: none preencoder_conf: input_size: 256 # Note: If the upstream is changed, please change this value accordingly. output_size: 256
specaug: specaug specaug_conf: apply_time_warp: true time_warp_window: 5 time_warp_mode: bicubic apply_freq_mask: true freq_mask_width_range:
Did you update to the latest espnet?
Will try pulling out the latest repo and let you know @simpleoier in a while
The latest code-repo with the same conf is giving the following error in the training stage
File "/home/taruns/.conda/envs/esp/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/taruns/.conda/envs/esp/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/raid/home/taruns/espnet/espnet2/bin/asr_train.py", line 23, in
@simpleoier, any suggestions on how to rectify this?
The latest code-repo with the same conf is giving the following error in the training stage
File "/home/taruns/.conda/envs/esp/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/taruns/.conda/envs/esp/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/raid/home/taruns/espnet/espnet2/bin/asr_train.py", line 23, in main() File "/raid/home/taruns/espnet/espnet2/bin/asr_train.py", line 19, in main ASRTask.main(cmd=cmd) File "/raid/home/taruns/espnet/espnet2/tasks/abs_task.py", line 1019, in main cls.main_worker(args) File "/raid/home/taruns/espnet/espnet2/tasks/abs_task.py", line 1315, in main_worker cls.trainer.run( File "/raid/home/taruns/espnet/espnet2/train/trainer.py", line 286, in run all_steps_are_invalid = cls.train_one_epoch( File "/raid/home/taruns/espnet/espnet2/train/trainer.py", line 589, in train_one_epoch loss.backward() File "/home/taruns/.conda/envs/esp/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/taruns/.conda/envs/esp/lib/python3.9/site-packages/torch/autograd/init.py", line 154, in backward Variable._execution_engine.run_backward( RuntimeError: cudnn RNN backward can only be called in training mode
This got solved by pulling the latest stable release. I am checking the weights updation of the frontend with this stable release.
@sw005320 @simpleoier I pulled the latest stable release but the front-end weights are not changing. Can you please help me out?
I'll test it.
Can you use this config?
batch_type: folded batch_bins: 5000000 batch_size: 8 accum_grad: 2 #1 max_epoch: 100 patience: none init: none best_model_criterion:
encoder: conformer encoder_conf: output_size: 512 attention_heads: 8 linear_units: 2048 num_blocks: 12 dropout_rate: 0.1 positional_dropout_rate: 0.1 attention_dropout_rate: 0.1 input_layer: conv2d normalize_before: true macaron_style: true pos_enc_layer_type: "rel_pos" selfattention_layer_type: "rel_selfattn" activation_type: "swish" use_cnn_module: true cnn_module_kernel: 15
decoder: transformer decoder_conf: attention_heads: 8 linear_units: 2048 num_blocks: 6 dropout_rate: 0.1 positional_dropout_rate: 0.1 self_attention_dropout_rate: 0.1 src_attention_dropout_rate: 0.1
model_conf: ctc_weight: 0.3 lsm_weight: 0.1 length_normalized_loss: false
optim: adam optim_conf: lr: 0.0015 scheduler: warmuplr scheduler_conf: warmup_steps: 25000
frontend: s3prl frontend_conf: frontend_conf: upstream: modified_cpc download_dir: ./hub multilayer_feature: False
preencoder: linear
preencoder_conf:
input_size: 256
output_size: 80
specaug: specaug specaug_conf: apply_time_warp: true time_warp_window: 5 time_warp_mode: bicubic apply_freq_mask: true freq_mask_width_range:
Hi @tarunsaib1997 , I know the reason of previous error RuntimeError: cudnn RNN backward can only be called in training mode
. It's because of this line. We set the frontend to be eval mode. It is used because we want to avoid the batch norm / dropout effects in most cases. And Transformer model works with it. However, LSTM may not.
After I removed that line, the error disappears. I also checked the gradient of the frontend parameters. The gradients are not zero. Hope this can help you.
(Pdb) p model.frontend.upstream.model.gAR.baseNet.weight_ih_l0.grad
tensor([[-0.0237, -0.0463, -0.0360, ..., -0.0031, -0.1224, -0.0897],
[-0.0189, 0.0596, -0.0175, ..., -0.0410, 0.0529, 0.1328],
[-0.0255, -0.0098, 0.0256, ..., 0.0307, -0.1083, -0.0451],
...,
[ 0.0102, 0.0164, 0.0138, ..., 0.0110, -0.0148, 0.0320],
[ 0.0188, 0.0066, 0.0119, ..., 0.0150, 0.0090, 0.0128],
[-0.0122, -0.0241, -0.0020, ..., 0.0049, -0.0012, -0.0410]],
device='cuda:0')
If I comment out that line and rerun the code, the gradients should back-prop right? @simpleoier
Yes. In my case, the parameters in the frontend have non-zero gradient during back-prop. As a consequence, the optimizer would update the parameters.
Thanks, this worked out.
I was using the following conf in front-end part
frontend: s3prl frontend_conf: frontend_conf: upstream: modified_cpc # Note: If the upstream is changed, please change the input_size in the preencoder. download_dir: ./hub multilayer_feature: True
preencoder: none preencoder_conf: input_size: 256 # Note: If the upstream is changed, please change this value accordingly. output_size: 256
I ran for two epochs and saw that there are no changes in the frontend weights.
Will the gradients back-prop through the modified_cpc model frontend without any changes in the pipeline? If not, can you suggest a way how to do that?