Open zhw123456789 opened 11 months ago
Hi,
can you let me know how are you running the audio encoder? It seems the 2,32]
is the [frequency, time]
. This is because HTSAT treat audio spectrogram as image (2D). We will take the average over frequency dimension somewhere before the final output.
Hi,great work!But when i try to take a look at the shape of last_hidden_state,i encounter some problems.The codes are the same as official document.And that is from datasets import load_dataset from transformers import AutoProcessor, ClapAudioModel
dataset = load_dataset("ashraq/esc50") audio_sample = dataset["train"]["audio"][0]["array"]
model = ClapAudioModel.from_pretrained("laion/clap-htsat-fused") processor = AutoProcessor.from_pretrained("laion/clap-htsat-fused")
inputs = processor(audios=audio_sample, return_tensors="pt")
outputs = model(**inputs) last_hidden_state = outputs.last_hidden_state but the output is [1,768,2,32] which is not compatible to what i've seen in official document.It's expected to be last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model. Am i right or i miss some key information?