deepsound-project / samplernn-pytorch

PyTorch implementation of SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
MIT License
288 stars 75 forks source link

Explanation of `frame_sizes` and `ns_frame_samples` #17

Open simopal6 opened 6 years ago

simopal6 commented 6 years ago

Hello, can you please explain the purpose of frame_sizes and ns_frame_samples in the SampleRNN constructor?

I get the meaning of frame_sizes from the paper. However, there's something strange (at least to me): in the paper, especially in the main figure, it seems the the frame size at tier 3 is 16 and the frame size at tier 2. In the code, you use the same values (frame_sizes = [16, 4]), however it seems that the order is reversed, because in Predictor's forward() you scan the RNNs in reversed order, so apparently you use 4 for tier 3 and 16 for tier 2. Is there something I'm not getting right here...?

Besides, what's the purpose of n_frame_samples for each RNN?

Thanks!

ghost commented 5 years ago

It's been a long time but if you have found answers for this I'll gladly take them, i'm also confused

simopal6 commented 5 years ago

I remember getting an intuition of how that worked, but I can't remember exactly what that was. I think that the actual frame size was the product between the two vectors, or something like that... Sorry I can't be of much help :(

BertSam commented 4 years ago

Hello, It's been an even longer time but I think I'm starting to understand it (I had to read the helper description at least 50 times...). The helper description says :

parser.add_argument( '--frame_sizes', nargs='+', type=int, required=True, help='frame sizes in terms of the number of lower tier frames, \ starting from the lowest RNN tier' )

So I think you have to give the number of same you want in a given fram as a function of all the other frame from the lower tier.

Ex: From the paper "HIGH-QUALITY SPEECH CODING WITH SAMPLE RNN" I need the following frame sizes: FS (1) = FS (2) = 2, FS (3) = 16 and FS (4) = 160.

Intuitively I would put as argument: --frame_sizes 2 2 16 160 or --frame_sizes 160 16 2 2

But for what I understand I would need to put [as argument] : --frame_sizes 2 1 8 10 ;

-2 because it's the lowest tier -1 because the lower tier frame yeild 2 (2x1=2) -8 because the lower tier frame now yeild 2 (2x8=16) -10 because the lower tier frame now yeild 16 (10x16=160)

That would explain the use of ns_frame_samples = map(int, np.cumprod(frame_sizes))

However, I might just not understand as well. I don't know why they would do it that way because it is really confusing if yes (at least for me)

Hope It helped.

If you have more info please do correct me.

Tks

-bert