ZhikangNiu / encodec-pytorch

unofficial implementation of the High Fidelity Neural Audio Compression
MIT License
127 stars 11 forks source link

Problems with model performance #26

Closed zhangchi2004 closed 1 month ago

zhangchi2004 commented 1 month ago

I attempt to train the encodec model on a 16kHz dataset with about 50000 waveforms. I am training on 8 gpus on 2 machines. I use tensor_cut = 65536, batch_size = 32(each gpu), ratios = [8,5,4,4], and lr = 5e-5. (Other configs set as default.) The model loss can converge to about the following values:

2024-07-13 17:12:10,985: INFO: [train_with_torchrun.py: 146]: Epoch 100 120/120 Avg loss_G: 8.3933  Avg losses_G: l_t: 0.0886   l_f: 5.9251 l_g: 0.5467 l_feat: 0.2731  Avg loss_W: 0.1236  lr_G: 5.306871e-06  lr_D: 5.306871e-06  loss_disc: 1.8721
2024-07-13 17:12:39,658: INFO: [train_with_torchrun.py: 165]: | TEST | epoch: 100 | loss_g: 6.761796489357948 | loss_disc: 1.8528

But the reconstructed audio is horrible. I would like to ask whether these loss values can properly indicate that the model can fit the dataset? Also, are there any issues with my config, and are there any other considerations for setting the config? Thank you!

zhangchi2004 commented 1 month ago

Meanwhile, I wonder how we synchronize the EMA update of the codebook in multi-gpus. I haven't found any relevant parts in the code.

ZhikangNiu commented 1 month ago

I see. I think you should increase your learning rate (3e-4 or 1e-3). And could you tell me why you change the ratios from [8,5,4,2] to [8,5,4,4] ? It means the encoder have 640x downsample rate and it could loss many audio informations.

ZhikangNiu commented 1 month ago

If you have 50000 waveforms, maybe overfit ?

zhangchi2004 commented 1 month ago

I see. I think you should increase your learning rate (3e-4 or 1e-3). And could you tell me why you change the ratios from [8,5,4,2] to [8,5,4,4] ? It means the encoder have 640x downsample rate and it could loss many audio informations.

Thank you. I have tried 3e-4 and waiting for the result. I changed the ratios in order to compare encodec with other compression methods at an equal compression rate.

ZhikangNiu commented 1 month ago

I see. I think you should increase your learning rate (3e-4 or 1e-3). And could you tell me why you change the ratios from [8,5,4,2] to [8,5,4,4] ? It means the encoder have 640x downsample rate and it could loss many audio informations.

Thank you. I have tried 3e-4 and waiting for the result. I changed the ratios in order to compare encodec with other compression methods at an equal compression rate.

yeah,maybe the compression rate is so large that decrease model's performance. BTW, multi gpu codebook update you can check https://github.com/ZhikangNiu/encodec-pytorch/blob/c6b6de91c4bfeb8582b5e51f1a8b599e04b7d860/quantization/core_vq.py#L157

zhangchi2004 commented 1 month ago

I see. I think you should increase your learning rate (3e-4 or 1e-3). And could you tell me why you change the ratios from [8,5,4,2] to [8,5,4,4] ? It means the encoder have 640x downsample rate and it could loss many audio informations.

Thank you. I have tried 3e-4 and waiting for the result. I changed the ratios in order to compare encodec with other compression methods at an equal compression rate.

yeah,maybe the compression rate is so large that decrease model's performance. BTW, multi gpu codebook update you can check

https://github.com/ZhikangNiu/encodec-pytorch/blob/c6b6de91c4bfeb8582b5e51f1a8b599e04b7d860/quantization/core_vq.py#L157

I have two questions regarding this. Firstly, in the original code these two lines are commented, and why doesn't it affect performance? Second, the buffers are only broadcasted in the initialization and after expiration, but during normal EMA optimization (self.ema_inplace) I don't see any broadcast. In my understanding the codebook vectors will move differently on different gpus due to different data.

ZhikangNiu commented 1 month ago

I see. I think you should increase your learning rate (3e-4 or 1e-3). And could you tell me why you change the ratios from [8,5,4,2] to [8,5,4,4] ? It means the encoder have 640x downsample rate and it could loss many audio informations.

Thank you. I have tried 3e-4 and waiting for the result. I changed the ratios in order to compare encodec with other compression methods at an equal compression rate.

yeah,maybe the compression rate is so large that decrease model's performance. BTW, multi gpu codebook update you can check https://github.com/ZhikangNiu/encodec-pytorch/blob/c6b6de91c4bfeb8582b5e51f1a8b599e04b7d860/quantization/core_vq.py#L157

I have two questions regarding this. Firstly, in the original code these two lines are commented, and why doesn't it affect performance? Second, the buffers are only broadcasted in the initialization and after expiration, but during normal EMA optimization (self.ema_inplace) I don't see any broadcast. In my understanding the codebook vectors will move differently on different gpus due to different data.

  1. I think it actually use the GPU 0 codebook
  2. I think your understanding is correct. BTW, you can double check different GPU codebook weight during training.