Snowdar / asv-subtools

An Open Source Tools for Speaker Recognition
Apache License 2.0
587 stars 135 forks source link

the num_targets and the max label in train.egs.csv are not equal #53

Open JunLi0514 opened 1 year ago

JunLi0514 commented 1 year ago

Hi, I try to run the CNCeleb recipe, but a RuntimeError appears:

#### Training will run for 6 epochs.
Traceback (most recent call last):
  File "/home/ubuntu/kaldi/egs/xmuspeech/sre/subtools/pytorch/libs/training/trainer.py", line 283, in run
    loss, acc = self.train_one_batch(batch)
  File "/home/ubuntu/kaldi/egs/xmuspeech/sre/subtools/pytorch/libs/training/trainer.py", line 182, in train_one_batch
    loss = model.get_loss(model_forward(inputs), targets)
  File "/home/ubuntu/kaldi/egs/xmuspeech/sre/subtools/pytorch/libs/support/utils.py", line 157, in wrapper
    return function(self, *transformed)
  File "/home/ubuntu/kaldi/egs/xmuspeech/sre/exp/SEResnet34_am_train_fbank40/config/resnet-se-xvector.py", line 559, in get_loss
    return self.loss(inputs, targets)
  File "/home/ubuntu/miniconda3/envs/subtools/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/kaldi/egs/xmuspeech/sre/subtools/pytorch/libs/nnet/loss.py", line 360, in forward
    return self.loss_function(outputs/self.t, targets) + self.ring_loss * ring_loss
  File "/home/ubuntu/miniconda3/envs/subtools/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/subtools/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1150, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/ubuntu/miniconda3/envs/subtools/lib/python3.8/site-packages/torch/nn/functional.py", line 2846, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
/opt/conda/conda-bld/pytorch_1634272172048/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:111: operator(): block: [0,0,0], thread: [55,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.

That means the num_speakers output of FC classifier is less than the label. And I find the num_targets in exp/egs/train_sequential/info is 2687, while the max label in train.egs.csv is 2711. So could you please tell me which script generates the exp/egs/train_sequential/info/num_targets?

JunLi0514 commented 1 year ago

Thanks to syousen, for offline egs, get_chunk_egs() in subtools/pytorch/pipeline/onestep/get_chunk_egs.py will generate num_targets file. I find after the train, val set split, the num_spkrs of the train set is reduced because some speakers will be removed since their utterances are all split to the val set. While it has no influence on the utt2spk label utt2spk_int, because it's not updated in filter() of the KaldiDataset in subtools/pytorch/libs/egs/kaldi_dataset.py. Only attributes belong to 'utt_first_files' and 'spk_first_files' are changed in filter() function. So I recommend in subtools/pytorch/pipeline/onestep/get_chunk_egs.py, use dataset.num_spks instead of trainset.num_spks to generate the */info/num_targets file.