Open spagliarini opened 3 years ago
Ah, good question. The last layer of the generator has dimension 16384 simply because it is a power of four; each of the five layers of the generator increases the number of timesteps by a factor of four, starting from 16 (arbitrary choice).
This output length could represent any amount of time depending on sampling rate, but 16kHz is a common sampling rate in speech processing and conveniently works out to around one second of generated audio (our goal), so we went with that.
Hi!
I would like to clean up a doubt I have. Is the fact that the last layer of the generator has dimension 16384 related to CNN constraints or is it important that the dimension is a bit more than one would like to obtain? Here, 1 s.
Thank you in advance!!!