YuanGongND / whisper-at

Code and Pretrained Models for Interspeech 2023 Paper "Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong Audio Event Taggers"
BSD 2-Clause "Simplified" License
318 stars 25 forks source link

invalid for input of size 95904000 ? #18

Closed herbiel closed 8 months ago

herbiel commented 8 months ago

in the train stage,it still fail , File "/opt/whisper/whisper-at/src/whisper_at_train/models.py", line 172, in forward audio_rep = audio_rep.reshape(Bself.n_layer, audio_rep.shape[2], audio_rep.shape[3]) # [B32, 25, 1280] RuntimeError: shape '[192, 25, 80]' is invalid for input of size 95904000

YuanGongND commented 8 months ago

check your audio_rep shape, is your audio 16kHz?

herbiel commented 8 months ago

yes,i use file cmd in linux ,it show file 01.wav 01.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz

YuanGongND commented 8 months ago

Can you check the shape of audio_rep by audio_rep.shape?

herbiel commented 8 months ago
image

how to check ?

YuanGongND commented 8 months ago

put a print command before the line that reports the error.

herbiel commented 8 months ago
image image

like this ,result is 48.

YuanGongND commented 8 months ago

I apologize for that, but I do not have bandwidth for basics (and that is not releated to this project).

You should do

print(audio_rep.shape), not print(B), B is just the batch size. The error message tells you that the input shape has a problem, you should check that.

herbiel commented 8 months ago

this is the new result test shape torch.Size([48, 999, 25, 80])

YuanGongND commented 8 months ago

Your extracted whisper feature is not in the correct shape, it should be [48, 32, 25, 1280]. You need to debug the feature extraction.

Sorry I won't be able to help on such specific debugging.

herbiel commented 8 months ago
image

noise_robust_asr/intermediate_feat_extract/as_full# sh extract_as_full_whisper_all.sh 0 ,use this to extracted whisper feature is right ?

YuanGongND commented 8 months ago

https://github.com/YuanGongND/whisper-at/blob/01b01d63a79334f49e738eb1e77b1429653dd71e/src/noise_robust_asr/intermediate_feat_extract/whisper_feat_extracrt/whisper/transcribe.py#L51-L52

You should get the second return, not the first, this may be the problem.

YuanGongND commented 8 months ago

print the shape of audio_rep after each line of these https://github.com/YuanGongND/whisper-at/blob/01b01d63a79334f49e738eb1e77b1429653dd71e/src/noise_robust_asr/intermediate_feat_extract/as_full/extract_as_full_whisper_all.py#L34-L43

herbiel commented 8 months ago

it's like this

image

result for test ----- (tensor([[[-0.2349, -0.2363, -0.2385, ..., 0.0000, 0.0000, 0.0000], [-0.5088, -0.5103, -0.5127, ..., 0.0000, 0.0000, 0.0000], [-1.0908, -1.0908, -1.0908, ..., 0.0000, 0.0000, 0.0000], ..., [-1.0908, -1.0908, -1.0908, ..., 0.0000, 0.0000, 0.0000], [-1.0908, -1.0908, -1.0908, ..., 0.0000, 0.0000, 0.0000], [-1.0908, -1.0908, -1.0908, ..., 0.0000, 0.0000, 0.0000]]], device='cuda:0', dtype=torch.float16), tensor([[[[-1.3794e-01, 1.5405e-01, -2.5732e-01, -1.2549e+00, -5.5029e-01], [-6.8307e-05, 8.9062e-01, 7.8125e-01, 4.6875e-01, -9.3164e-01], [-2.2873e-02, 6.2500e-01, 5.4785e-01, -2.2461e-02, 4.3242e+00], ..., [ 1.0000e+00, 6.7529e-01, 6.7529e-01, 6.8164e-01, 3.2578e+00], [ 9.9902e-01, 8.1152e-01, 8.9746e-01, 1.1963e+00, 6.4922e+00], [ 9.7559e-01, 1.1602e+00, 1.0117e+00, 1.5303e+00, 5.1445e+00]],

     [[ 6.7139e-01,  6.8799e-01,  5.6592e-01,  5.7275e-01,  7.3682e-01],
      [ 8.1494e-01,  6.3916e-01,  7.5586e-01,  6.8652e-01,  1.6221e+00],
      [ 6.2744e-01,  6.7725e-01,  8.5059e-01,  1.0605e+00,  2.0586e+00],
      ...,
      [ 8.3984e-01,  8.5938e-01,  7.8906e-01,  8.9795e-01,  1.0957e+00],
      [ 8.4961e-01,  8.6865e-01,  9.9023e-01,  8.4375e-01,  1.4121e+00],
      [ 8.3008e-01,  7.4463e-01,  6.7480e-01,  1.2559e+00,  2.1641e+00]],

     [[ 7.3926e-01,  5.4688e-01,  1.8628e-01,  2.0117e-01,  3.5498e-01],
      [ 9.4434e-01,  7.3047e-01,  8.3838e-01,  9.0918e-01,  2.3750e+00],
      [ 8.0957e-01,  7.1582e-01,  8.6523e-01,  1.0615e+00,  2.3125e+00],
      ...,
      [ 8.3984e-01,  7.0947e-01,  7.1436e-01,  8.8428e-01,  9.6777e-01],
      [ 8.4863e-01,  7.6709e-01,  6.5234e-01,  5.5908e-01,  9.7168e-01],
      [ 8.3008e-01,  7.5244e-01,  5.8154e-01,  1.3457e+00,  2.1016e+00]],

     ...,

     [[ 6.1621e-01,  2.8564e-01,  3.0029e-01,  1.7029e-01,  2.6074e-01],
      [ 7.0117e-01,  3.3252e-01,  4.6875e-01, -2.6489e-01, -1.4170e+00],
      [-9.7705e-01,  3.1201e-01,  2.7588e-01,  1.8726e-01,  2.8711e-01],
      ...,
      [ 8.7939e-01,  3.3887e-01,  2.2229e-01,  4.9805e-01,  7.6904e-01],
      [ 8.3105e-01,  4.8584e-01,  6.2305e-01,  5.2393e-01,  1.2402e+00],
      [ 8.2910e-01,  3.0103e-01,  1.8726e-01,  3.1909e-01,  1.9849e-01]],

     [[ 1.0264e+00,  3.1592e-01,  2.9541e-01,  2.1558e-01, -1.6772e-01],
      [-1.7444e-01,  2.5464e-01,  3.8184e-01, -4.3896e-01, -1.7061e+00],
      [-2.6758e-01,  3.9087e-01,  3.9624e-01,  3.2983e-01,  3.4814e-01],
      ...,
      [ 8.7939e-01,  3.5718e-01,  2.1436e-01,  3.6670e-01,  9.4434e-01],
      [ 8.3105e-01,  4.6973e-01,  6.0693e-01,  5.2002e-01,  1.4033e+00],
      [ 8.2910e-01,  3.0078e-01,  1.6309e-01,  2.7197e-01,  8.0957e-01]],

     [[ 5.1904e-01,  3.7292e-02, -7.5684e-03,  8.0566e-02, -2.5928e-01],
      [-9.0381e-01,  4.7363e-02,  1.8872e-01, -7.0068e-01, -2.0137e+00],
      [ 5.7568e-01,  6.1084e-01,  5.5566e-01,  5.1855e-01,  8.7012e-01],
      ...,
      [ 8.7939e-01,  3.6328e-01,  2.3169e-01, -3.2471e-02,  8.7354e-01],
      [ 8.3105e-01,  4.8071e-01,  6.9775e-01,  4.6118e-01,  1.0801e+00],
      [ 8.2910e-01,  3.5229e-01,  2.2546e-01,  4.4678e-02,  1.0332e+00]]]],
   device='cuda:0', dtype=torch.float16, grad_fn=<StackBackward0>))
YuanGongND commented 8 months ago

please print the shape, not the exact tensor, and please print it after every line

herbiel commented 8 months ago

log.txt i have print them now ,i change the code like it.

image

the log.txt is output file.

YuanGongND commented 8 months ago

please please do audio_rep.shape not audio_rep

herbiel commented 8 months ago
image

this one?

YuanGongND commented 8 months ago

could you please change all lines to print the shape rather than the actual tensor?

herbiel commented 8 months ago
image

i have change the code like _, audio_rep = mdl.transcribe_audio(wav),and print(audio_rep.shape),but audiorep.shape is same ? , audio_rep = mdl.transcribe_audio(wav) print("output audio shape for "+wav) audio_rep = audio_rep[0] print(audio_rep.shape)

YuanGongND commented 8 months ago

The shape is correct for a tiny model (and is different from your previous shape).

this is the new result test shape torch.Size([48, 999, 25, 80])

I assume you changed the model size to tiny. The default training code requires large-v1 Whisper model in the feature extraction code, the size should be [500, 1280 (embed_dim), 32 (num_layer)] at the same printing point. It is just a shape issue, you would need some debugging. I cannot provide further support for this.

-Yuan

herbiel commented 8 months ago

yes,thanks for you help ,but thi problem has troubled me for a long time,i'm looking forward to slove it. in extract_as_full_whisper_all.py ,i change it to tiny,so i need to change the code in the whisper_at_train/run_as_full_train.sh model=whisper-high-lw_tr_1_8 to tiny ? as_full/extract_as_full_whisper_all.py:46:mdl_size_list = ['tiny'] # , 'large-v1', 'medium.en'

herbiel commented 8 months ago

i think the shape is normal ,but it has other problem . The learning rate scheduler starts at 15 epoch with decay rate of 0.750 every 5 epoches now training with as-full, main metrics: mAP, loss function: BCEWithLogitsLoss(), learning rate scheduler: <torch.optim.lr_scheduler.MultiStepLR object at 0x7f59f299ea70> current #steps=0, #epochs=1 start training...

2024-01-14 10:52:32.100556 current #epochs=1, #steps=0 test shape torch.Size([48, 4, 25, 384]) start validation [] Traceback (most recent call last): File "/opt/whisper/whisper-at/src/whisper_at_train/./run.py", line 155, in train(audio_model, train_loader, val_loader, args) File "/opt/whisper/whisper-at/src/whisper_at_train/traintest.py", line 137, in train stats, valid_loss = validate(audio_model, test_loader, args) File "/opt/whisper/whisper-at/src/whisper_at_train/traintest.py", line 232, in validate audio_output = torch.cat(A_predictions) RuntimeError: torch.cat(): expected a non-empty list of Tensors