espnet / espnet

End-to-End Speech Processing Toolkit
https://espnet.github.io/espnet/
Apache License 2.0
8.4k stars 2.17k forks source link

How to make frontend trainable in ESPNet2? #4334

Closed tarunsaib1997 closed 2 years ago

tarunsaib1997 commented 2 years ago

I was using the following conf in front-end part

frontend: s3prl frontend_conf: frontend_conf: upstream: modified_cpc # Note: If the upstream is changed, please change the input_size in the preencoder. download_dir: ./hub multilayer_feature: True

preencoder: none preencoder_conf: input_size: 256 # Note: If the upstream is changed, please change this value accordingly. output_size: 256

I ran for two epochs and saw that there are no changes in the frontend weights.

Will the gradients back-prop through the modified_cpc model frontend without any changes in the pipeline? If not, can you suggest a way how to do that?

simpleoier commented 2 years ago

@tarunsaib1997 Do you have freeze_param in your config file and frontend is in the freeze_param?

tarunsaib1997 commented 2 years ago

No, freeze_param was not used at all in the config file. Below is the entire conf file content used:

batch_type: folded #numel batch_bins: 5000000 #140000000 batch_size: 8 accum_grad: 2 #1 max_epoch: 100 patience: none init: none best_model_criterion:

encoder: conformer encoder_conf: output_size: 512 attention_heads: 8 linear_units: 2048 num_blocks: 12 dropout_rate: 0.1 positional_dropout_rate: 0.1 attention_dropout_rate: 0.1 input_layer: conv2d normalize_before: true macaron_style: true pos_enc_layer_type: "rel_pos" selfattention_layer_type: "rel_selfattn" activation_type: "swish" use_cnn_module: true cnn_module_kernel: 15

decoder: transformer decoder_conf: attention_heads: 8 linear_units: 2048 num_blocks: 6 dropout_rate: 0.1 positional_dropout_rate: 0.1 self_attention_dropout_rate: 0.1 src_attention_dropout_rate: 0.1

model_conf: ctc_weight: 0.3 lsm_weight: 0.1 length_normalized_loss: false

optim: adam optim_conf: lr: 0.0015 scheduler: warmuplr scheduler_conf: warmup_steps: 25000

frontend: s3prl frontend_conf: frontend_conf: upstream: modified_cpc # Note: If the upstream is changed, please change the input_size in the preencoder. download_dir: ./hub multilayer_feature: True

preencoder: none preencoder_conf: input_size: 256 # Note: If the upstream is changed, please change this value accordingly. output_size: 256

specaug: specaug specaug_conf: apply_time_warp: true time_warp_window: 5 time_warp_mode: bicubic apply_freq_mask: true freq_mask_width_range:

simpleoier commented 2 years ago

Did you update to the latest espnet?

tarunsaib1997 commented 2 years ago

Will try pulling out the latest repo and let you know @simpleoier in a while

tarunsaib1997 commented 2 years ago

The latest code-repo with the same conf is giving the following error in the training stage

File "/home/taruns/.conda/envs/esp/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/taruns/.conda/envs/esp/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/raid/home/taruns/espnet/espnet2/bin/asr_train.py", line 23, in main() File "/raid/home/taruns/espnet/espnet2/bin/asr_train.py", line 19, in main ASRTask.main(cmd=cmd) File "/raid/home/taruns/espnet/espnet2/tasks/abs_task.py", line 1019, in main cls.main_worker(args) File "/raid/home/taruns/espnet/espnet2/tasks/abs_task.py", line 1315, in main_worker cls.trainer.run( File "/raid/home/taruns/espnet/espnet2/train/trainer.py", line 286, in run all_steps_are_invalid = cls.train_one_epoch( File "/raid/home/taruns/espnet/espnet2/train/trainer.py", line 589, in train_one_epoch loss.backward() File "/home/taruns/.conda/envs/esp/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/taruns/.conda/envs/esp/lib/python3.9/site-packages/torch/autograd/init.py", line 154, in backward Variable._execution_engine.run_backward( RuntimeError: cudnn RNN backward can only be called in training mode

tarunsaib1997 commented 2 years ago

@simpleoier, any suggestions on how to rectify this?

tarunsaib1997 commented 2 years ago

The latest code-repo with the same conf is giving the following error in the training stage

File "/home/taruns/.conda/envs/esp/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/taruns/.conda/envs/esp/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/raid/home/taruns/espnet/espnet2/bin/asr_train.py", line 23, in main() File "/raid/home/taruns/espnet/espnet2/bin/asr_train.py", line 19, in main ASRTask.main(cmd=cmd) File "/raid/home/taruns/espnet/espnet2/tasks/abs_task.py", line 1019, in main cls.main_worker(args) File "/raid/home/taruns/espnet/espnet2/tasks/abs_task.py", line 1315, in main_worker cls.trainer.run( File "/raid/home/taruns/espnet/espnet2/train/trainer.py", line 286, in run all_steps_are_invalid = cls.train_one_epoch( File "/raid/home/taruns/espnet/espnet2/train/trainer.py", line 589, in train_one_epoch loss.backward() File "/home/taruns/.conda/envs/esp/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/taruns/.conda/envs/esp/lib/python3.9/site-packages/torch/autograd/init.py", line 154, in backward Variable._execution_engine.run_backward( RuntimeError: cudnn RNN backward can only be called in training mode

This got solved by pulling the latest stable release. I am checking the weights updation of the frontend with this stable release.

tarunsaib1997 commented 2 years ago

@sw005320 @simpleoier I pulled the latest stable release but the front-end weights are not changing. Can you please help me out?

simpleoier commented 2 years ago

I'll test it.

tarunsaib1997 commented 2 years ago

Can you use this config?

batch_type: folded batch_bins: 5000000 batch_size: 8 accum_grad: 2 #1 max_epoch: 100 patience: none init: none best_model_criterion:

encoder: conformer encoder_conf: output_size: 512 attention_heads: 8 linear_units: 2048 num_blocks: 12 dropout_rate: 0.1 positional_dropout_rate: 0.1 attention_dropout_rate: 0.1 input_layer: conv2d normalize_before: true macaron_style: true pos_enc_layer_type: "rel_pos" selfattention_layer_type: "rel_selfattn" activation_type: "swish" use_cnn_module: true cnn_module_kernel: 15

decoder: transformer decoder_conf: attention_heads: 8 linear_units: 2048 num_blocks: 6 dropout_rate: 0.1 positional_dropout_rate: 0.1 self_attention_dropout_rate: 0.1 src_attention_dropout_rate: 0.1

model_conf: ctc_weight: 0.3 lsm_weight: 0.1 length_normalized_loss: false

optim: adam optim_conf: lr: 0.0015 scheduler: warmuplr scheduler_conf: warmup_steps: 25000

frontend: s3prl frontend_conf: frontend_conf: upstream: modified_cpc download_dir: ./hub multilayer_feature: False

preencoder: linear preencoder_conf: input_size: 256
output_size: 80

specaug: specaug specaug_conf: apply_time_warp: true time_warp_window: 5 time_warp_mode: bicubic apply_freq_mask: true freq_mask_width_range:

simpleoier commented 2 years ago

Hi @tarunsaib1997 , I know the reason of previous error RuntimeError: cudnn RNN backward can only be called in training mode. It's because of this line. We set the frontend to be eval mode. It is used because we want to avoid the batch norm / dropout effects in most cases. And Transformer model works with it. However, LSTM may not.

After I removed that line, the error disappears. I also checked the gradient of the frontend parameters. The gradients are not zero. Hope this can help you.

(Pdb) p model.frontend.upstream.model.gAR.baseNet.weight_ih_l0.grad
tensor([[-0.0237, -0.0463, -0.0360,  ..., -0.0031, -0.1224, -0.0897],
        [-0.0189,  0.0596, -0.0175,  ..., -0.0410,  0.0529,  0.1328],
        [-0.0255, -0.0098,  0.0256,  ...,  0.0307, -0.1083, -0.0451],
        ...,
        [ 0.0102,  0.0164,  0.0138,  ...,  0.0110, -0.0148,  0.0320],
        [ 0.0188,  0.0066,  0.0119,  ...,  0.0150,  0.0090,  0.0128],
        [-0.0122, -0.0241, -0.0020,  ...,  0.0049, -0.0012, -0.0410]],
       device='cuda:0')
tarunsaib1997 commented 2 years ago

If I comment out that line and rerun the code, the gradients should back-prop right? @simpleoier

simpleoier commented 2 years ago

Yes. In my case, the parameters in the frontend have non-zero gradient during back-prop. As a consequence, the optimizer would update the parameters.

tarunsaib1997 commented 2 years ago

Thanks, this worked out.