Hello, I would like to use the very concise rave representations for labeling a dataset of small speech chunks for a concatenative synthesizer. To do that I need to know how much of the audio each tensor items labels, and I found out for one second of 44100 hz sampled audio the model outputs 22 embeddings. Problem is that, when you try todivide 1 second per 22, python spits out a very scary looking 0.045454545454545456. Am I missing something? Is the model stride available somewhere in the docs? I also found difficulty in using the model with pure python as I had to find out the code within a discussion, so I could also be using the wrong code. Thanks!
Hello, I would like to use the very concise rave representations for labeling a dataset of small speech chunks for a concatenative synthesizer. To do that I need to know how much of the audio each tensor items labels, and I found out for one second of 44100 hz sampled audio the model outputs 22 embeddings. Problem is that, when you try todivide 1 second per 22, python spits out a very scary looking 0.045454545454545456. Am I missing something? Is the model stride available somewhere in the docs? I also found difficulty in using the model with pure python as I had to find out the code within a discussion, so I could also be using the wrong code. Thanks!