acids-ircam / RAVE

Official implementation of the RAVE model: a Realtime Audio Variational autoEncoder
Other
1.3k stars 176 forks source link

finding out the stride of the model #243

Closed albluc24 closed 9 months ago

albluc24 commented 1 year ago

Hello, I would like to use the very concise rave representations for labeling a dataset of small speech chunks for a concatenative synthesizer. To do that I need to know how much of the audio each tensor items labels, and I found out for one second of 44100 hz sampled audio the model outputs 22 embeddings. Problem is that, when you try todivide 1 second per 22, python spits out a very scary looking 0.045454545454545456. Am I missing something? Is the model stride available somewhere in the docs? I also found difficulty in using the model with pure python as I had to find out the code within a discussion, so I could also be using the wrong code. Thanks!

domkirke commented 9 months ago

22 embeddings seems weird, isnt it rather the cropped latent size you are looking at? Due to the overall model the stride should be a power of 2.

caillonantoine commented 9 months ago

The model has a stride of 2048 :)

domkirke commented 9 months ago

(for a v2 config)