acids-ircam / RAVE

Official implementation of the RAVE model: a Realtime Audio Variational autoEncoder
Other
1.34k stars 184 forks source link

finding out the stride of the model #243

Closed albluc24 closed 11 months ago

albluc24 commented 1 year ago

Hello, I would like to use the very concise rave representations for labeling a dataset of small speech chunks for a concatenative synthesizer. To do that I need to know how much of the audio each tensor items labels, and I found out for one second of 44100 hz sampled audio the model outputs 22 embeddings. Problem is that, when you try todivide 1 second per 22, python spits out a very scary looking 0.045454545454545456. Am I missing something? Is the model stride available somewhere in the docs? I also found difficulty in using the model with pure python as I had to find out the code within a discussion, so I could also be using the wrong code. Thanks!

domkirke commented 11 months ago

22 embeddings seems weird, isnt it rather the cropped latent size you are looking at? Due to the overall model the stride should be a power of 2.

caillonantoine commented 11 months ago

The model has a stride of 2048 :)

domkirke commented 11 months ago

(for a v2 config)