Open hungchiayu1 opened 1 week ago
@hungchiayu1 the audio waveform is randomly cropped to a maximum of 10s with sample rate = 48000
- this will result in different audio embeddings for the same file. Also the differences will increase with the length and musical variation of the track.
This might also explain why the differences increase when using enableFusion=false
(default) as seen in #90
Workaround: You can fix it by chunking your audio file in 10s bits and pass them as a batch to the model. Then you could average the 10s embeddings to receive a track embedding.
@lukewys @RetroCirce the max length is currently set to 480 000
samples (10s) for the audio encoder inputs. Does the audio encoder only support up to 10 seconds or can the max length be changed?
Is such behaviour expected? When i use get_audio_embedding_from_data multiple times and take the cosine similarity between the audio, the cosine similarity varies significantly.