TencentGameMate / chinese_speech_pretrain

chinese speech pretrained models
997 stars 84 forks source link

Problem about time shape #30

Closed huutuongtu closed 1 year ago

huutuongtu commented 1 year ago

I have tried this code to extract phonetic feature before the last linear

import torch
import torch.nn.functional as F
import soundfile as sf
import librosa
import pandas as pd
import numpy as np
from transformers import (
    Wav2Vec2FeatureExtractor,
    Wav2Vec2ForPreTraining,
    Wav2Vec2Model,
)
mask_prob=0.0
mask_length=10

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('/home/tuht/PAPL-Attention/pretrained_china')
model = Wav2Vec2ForPreTraining.from_pretrained('/home/tuht/PAPL-Attention/pretrained_china')
model = model.to('cuda')
model = model.eval()
newmodel = torch.nn.Sequential(*(list(model.children())[:-4]))
# print(newmodel)
newmodel.to('cuda')
newmodel.eval()

def phonetic_embedding(path):
    wav, sr = librosa.load(path)
    input_values = feature_extractor(wav, return_tensors="pt",sampling_rate = 16000).input_values
    input_values = input_values.to('cuda')
    with torch.no_grad():
        outputs = newmodel(input_values)
        last_hidden_state = outputs.last_hidden_state
    x = last_hidden_state.squeeze(0).detach().cpu().numpy()
    return x

print(phonetic_embedding("/home/tuht/mandarin_acoustic/000100126.WAV").shape)

and the output:

(644, 768)

My audio: 000100126.WAV have duration: 9.36s, sample_rate 16000 As I know, frame is 20ms, so expectation output should be 9.36/0.02-1 =>

(467, 768)

I don't know why it has this different? Can you explain this? @pengchengguo @LiuShixing Thank you