Audio embedding differs significantly each time

Is such behaviour expected? When i use get_audio_embedding_from_data multiple times and take the cosine similarity between the audio, the cosine similarity varies significantly.

import torch
import laion_clap
import numpy as np

cos_sim = torch.nn.CosineSimilarity()

clap = laion_clap.CLAP_Module(enable_fusion=False)
clap.load_ckpt() # download the default pretrained checkpoint.

def int16_to_float32(x):
    return (x / 32767.0).astype(np.float32)

def float32_to_int16(x):
    x = np.clip(x, a_min=-1., a_max=1.)
    return (x * 32767.).astype(np.int16)

audio_data, _ = librosa.load('test0.wav', sr=48000) # sample rate should be 48000
audio_data = audio_data.reshape(1, -1) # Make it (1,T) or (N,T)
audio_data = torch.from_numpy(int16_to_float32(float32_to_int16(audio_data))).float() # quantize before send it in to the model
for _ in range(5):
    audio_embed1 = clap.get_audio_embedding_from_data(x = audio_data, use_tensor=True)
    audio_embed2 = clap.get_audio_embedding_from_data(x = audio_data, use_tensor=True)
    print(cos_sim(audio_embed1,audio_embed2))
 ### Outputs
tensor([0.9915], device='cuda:0', grad_fn=<SumBackward1>)
tensor([0.2983], device='cuda:0', grad_fn=<SumBackward1>)
tensor([0.5371], device='cuda:0', grad_fn=<SumBackward1>)
tensor([0.3576], device='cuda:0', grad_fn=<SumBackward1>)
tensor([0.9719], device='cuda:0', grad_fn=<SumBackward1>)

@hungchiayu1 the audio waveform is randomly cropped to a maximum of 10s with sample rate = 48000 - this will result in different audio embeddings for the same file. Also the differences will increase with the length and musical variation of the track.

This might also explain why the differences increase when using enableFusion=false (default) as seen in #90

see: https://github.com/LAION-AI/CLAP/blob/8e558817d853808486768004fa1b61ac9d69f2a2/src/laion_clap/training/data.py#L465-L468

Workaround: You can fix it by chunking your audio file in 10s bits and pass them as a batch to the model. Then you could average the 10s embeddings to receive a track embedding.

@lukewys @RetroCirce the max length is currently set to 480 000 samples (10s) for the audio encoder inputs. Does the audio encoder only support up to 10 seconds or can the max length be changed?

LAION-AI / CLAP

Audio embedding differs significantly each time #166