Closed herbiel closed 8 months ago
check your audio_rep
shape, is your audio 16kHz?
yes,i use file cmd in linux ,it show file 01.wav 01.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz
Can you check the shape of audio_rep
by audio_rep.shape
?
how to check ?
put a print command before the line that reports the error.
like this ,result is 48.
I apologize for that, but I do not have bandwidth for basics (and that is not releated to this project).
You should do
print(audio_rep.shape)
, not print(B)
, B is just the batch size. The error message tells you that the input shape has a problem, you should check that.
this is the new result test shape torch.Size([48, 999, 25, 80])
Your extracted whisper feature is not in the correct shape, it should be [48, 32, 25, 1280]. You need to debug the feature extraction.
Sorry I won't be able to help on such specific debugging.
noise_robust_asr/intermediate_feat_extract/as_full# sh extract_as_full_whisper_all.sh 0 ,use this to extracted whisper feature is right ?
You should get the second return, not the first, this may be the problem.
print the shape of audio_rep
after each line of these https://github.com/YuanGongND/whisper-at/blob/01b01d63a79334f49e738eb1e77b1429653dd71e/src/noise_robust_asr/intermediate_feat_extract/as_full/extract_as_full_whisper_all.py#L34-L43
it's like this
result for test ----- (tensor([[[-0.2349, -0.2363, -0.2385, ..., 0.0000, 0.0000, 0.0000], [-0.5088, -0.5103, -0.5127, ..., 0.0000, 0.0000, 0.0000], [-1.0908, -1.0908, -1.0908, ..., 0.0000, 0.0000, 0.0000], ..., [-1.0908, -1.0908, -1.0908, ..., 0.0000, 0.0000, 0.0000], [-1.0908, -1.0908, -1.0908, ..., 0.0000, 0.0000, 0.0000], [-1.0908, -1.0908, -1.0908, ..., 0.0000, 0.0000, 0.0000]]], device='cuda:0', dtype=torch.float16), tensor([[[[-1.3794e-01, 1.5405e-01, -2.5732e-01, -1.2549e+00, -5.5029e-01], [-6.8307e-05, 8.9062e-01, 7.8125e-01, 4.6875e-01, -9.3164e-01], [-2.2873e-02, 6.2500e-01, 5.4785e-01, -2.2461e-02, 4.3242e+00], ..., [ 1.0000e+00, 6.7529e-01, 6.7529e-01, 6.8164e-01, 3.2578e+00], [ 9.9902e-01, 8.1152e-01, 8.9746e-01, 1.1963e+00, 6.4922e+00], [ 9.7559e-01, 1.1602e+00, 1.0117e+00, 1.5303e+00, 5.1445e+00]],
[[ 6.7139e-01, 6.8799e-01, 5.6592e-01, 5.7275e-01, 7.3682e-01],
[ 8.1494e-01, 6.3916e-01, 7.5586e-01, 6.8652e-01, 1.6221e+00],
[ 6.2744e-01, 6.7725e-01, 8.5059e-01, 1.0605e+00, 2.0586e+00],
...,
[ 8.3984e-01, 8.5938e-01, 7.8906e-01, 8.9795e-01, 1.0957e+00],
[ 8.4961e-01, 8.6865e-01, 9.9023e-01, 8.4375e-01, 1.4121e+00],
[ 8.3008e-01, 7.4463e-01, 6.7480e-01, 1.2559e+00, 2.1641e+00]],
[[ 7.3926e-01, 5.4688e-01, 1.8628e-01, 2.0117e-01, 3.5498e-01],
[ 9.4434e-01, 7.3047e-01, 8.3838e-01, 9.0918e-01, 2.3750e+00],
[ 8.0957e-01, 7.1582e-01, 8.6523e-01, 1.0615e+00, 2.3125e+00],
...,
[ 8.3984e-01, 7.0947e-01, 7.1436e-01, 8.8428e-01, 9.6777e-01],
[ 8.4863e-01, 7.6709e-01, 6.5234e-01, 5.5908e-01, 9.7168e-01],
[ 8.3008e-01, 7.5244e-01, 5.8154e-01, 1.3457e+00, 2.1016e+00]],
...,
[[ 6.1621e-01, 2.8564e-01, 3.0029e-01, 1.7029e-01, 2.6074e-01],
[ 7.0117e-01, 3.3252e-01, 4.6875e-01, -2.6489e-01, -1.4170e+00],
[-9.7705e-01, 3.1201e-01, 2.7588e-01, 1.8726e-01, 2.8711e-01],
...,
[ 8.7939e-01, 3.3887e-01, 2.2229e-01, 4.9805e-01, 7.6904e-01],
[ 8.3105e-01, 4.8584e-01, 6.2305e-01, 5.2393e-01, 1.2402e+00],
[ 8.2910e-01, 3.0103e-01, 1.8726e-01, 3.1909e-01, 1.9849e-01]],
[[ 1.0264e+00, 3.1592e-01, 2.9541e-01, 2.1558e-01, -1.6772e-01],
[-1.7444e-01, 2.5464e-01, 3.8184e-01, -4.3896e-01, -1.7061e+00],
[-2.6758e-01, 3.9087e-01, 3.9624e-01, 3.2983e-01, 3.4814e-01],
...,
[ 8.7939e-01, 3.5718e-01, 2.1436e-01, 3.6670e-01, 9.4434e-01],
[ 8.3105e-01, 4.6973e-01, 6.0693e-01, 5.2002e-01, 1.4033e+00],
[ 8.2910e-01, 3.0078e-01, 1.6309e-01, 2.7197e-01, 8.0957e-01]],
[[ 5.1904e-01, 3.7292e-02, -7.5684e-03, 8.0566e-02, -2.5928e-01],
[-9.0381e-01, 4.7363e-02, 1.8872e-01, -7.0068e-01, -2.0137e+00],
[ 5.7568e-01, 6.1084e-01, 5.5566e-01, 5.1855e-01, 8.7012e-01],
...,
[ 8.7939e-01, 3.6328e-01, 2.3169e-01, -3.2471e-02, 8.7354e-01],
[ 8.3105e-01, 4.8071e-01, 6.9775e-01, 4.6118e-01, 1.0801e+00],
[ 8.2910e-01, 3.5229e-01, 2.2546e-01, 4.4678e-02, 1.0332e+00]]]],
device='cuda:0', dtype=torch.float16, grad_fn=<StackBackward0>))
please print the shape, not the exact tensor, and please print it after every line
log.txt i have print them now ,i change the code like it.
the log.txt is output file.
please please do audio_rep.shape
not audio_rep
this one?
could you please change all lines to print the shape rather than the actual tensor?
i have change the code like _, audio_rep = mdl.transcribe_audio(wav),and print(audio_rep.shape),but audiorep.shape is same ? , audio_rep = mdl.transcribe_audio(wav) print("output audio shape for "+wav) audio_rep = audio_rep[0] print(audio_rep.shape)
The shape is correct for a tiny model (and is different from your previous shape).
this is the new result test shape torch.Size([48, 999, 25, 80])
I assume you changed the model size to tiny
. The default training code requires large-v1
Whisper model in the feature extraction code, the size should be [500, 1280 (embed_dim), 32 (num_layer)]
at the same printing point. It is just a shape issue, you would need some debugging. I cannot provide further support for this.
-Yuan
yes,thanks for you help ,but thi problem has troubled me for a long time,i'm looking forward to slove it. in extract_as_full_whisper_all.py ,i change it to tiny,so i need to change the code in the whisper_at_train/run_as_full_train.sh model=whisper-high-lw_tr_1_8 to tiny ? as_full/extract_as_full_whisper_all.py:46:mdl_size_list = ['tiny'] # , 'large-v1', 'medium.en'
2024-01-14 10:52:32.100556
current #epochs=1, #steps=0
test shape
torch.Size([48, 4, 25, 384])
start validation
[]
Traceback (most recent call last):
File "/opt/whisper/whisper-at/src/whisper_at_train/./run.py", line 155, in
in the train stage,it still fail , File "/opt/whisper/whisper-at/src/whisper_at_train/models.py", line 172, in forward audio_rep = audio_rep.reshape(Bself.n_layer, audio_rep.shape[2], audio_rep.shape[3]) # [B32, 25, 1280] RuntimeError: shape '[192, 25, 80]' is invalid for input of size 95904000