LAION-AI / CLAP

Contrastive Language-Audio Pretraining
https://arxiv.org/abs/2211.06687
Creative Commons Zero v1.0 Universal
1.43k stars 137 forks source link

dimension of last_hidden_state #135

Open zhw123456789 opened 11 months ago

zhw123456789 commented 11 months ago

Hi,great work!But when i try to take a look at the shape of last_hidden_state,i encounter some problems.The codes are the same as official document.And that is from datasets import load_dataset from transformers import AutoProcessor, ClapAudioModel

dataset = load_dataset("ashraq/esc50") audio_sample = dataset["train"]["audio"][0]["array"]

model = ClapAudioModel.from_pretrained("laion/clap-htsat-fused") processor = AutoProcessor.from_pretrained("laion/clap-htsat-fused")

inputs = processor(audios=audio_sample, return_tensors="pt")

outputs = model(**inputs) last_hidden_state = outputs.last_hidden_state but the output is [1,768,2,32] which is not compatible to what i've seen in official document.It's expected to be last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model. Am i right or i miss some key information?

lukewys commented 7 months ago

Hi,

can you let me know how are you running the audio encoder? It seems the 2,32] is the [frequency, time]. This is because HTSAT treat audio spectrogram as image (2D). We will take the average over frequency dimension somewhere before the final output.