Strange behavior using PyTorch DDP

1ytic / warp-rnnt

CUDA-Warp RNN-Transducer

MIT License

211 stars 41 forks source link

Strange behavior using PyTorch DDP #32

Open snakers4 opened 2 years ago

snakers4 commented 2 years ago

@1ytic Hi,

So far I have been able to use the loss with DDP on a single GPU , it behaves more or less as expected.

But when I use more than 1 device, the following happens:

On GPU-0 loss is calculated properly
On GPU-1 loss is close to zero for each batch

I checked the input tensors, devices, tensor values, etc - so far everything seems to be identical for GPU-0 and other GPUs.

snakers4 commented 2 years ago

@burchim By the way, since you used this loss, did you encounter anything of this sort in your work?

burchim commented 2 years ago

Hi @snakers4! Yes I had a similar problem with 4 GPU devices where the rnnt loss was properly computed on the first devices but 0 on the others. I don't really remember what was the exact cause but it had something to with tensor devices. Maybe the frames / label lengths.

I also recently experimented replacing it with the official torchaudio.transforms.RNNTLoss loss from torchaudio 0.10.0. Was working very well but I didn't try to do a full training with it.

snakers4 commented 2 years ago

Thanks for the heads up about the torchaudio loss! I remember seeing it sometime ago, but I totally forgot about it.

snakers4 commented 2 years ago

@burchim By the way, did you have RuntimeError: input length mismatch when migrating from warp-rnnt towards torchaudio?

burchim commented 2 years ago

Yes, this means that logits / target lengths tensors do not match the logits / target tensors. If you have logits lengths longer than your logits tensor for instance.

burchim commented 2 years ago

Because I used the targets lengths instead of logits lengths, stupid error

csukuangfj commented 2 years ago

Thanks for the heads up about the torchaudio loss!

@snakers4 You may find https://github.com/danpovey/fast_rnnt useful.