Open roedoejet opened 10 months ago
each batch is B T K
Where B is batch size and is set to 16 by default
T is time in frames. To calculate the number of frames in a Mel Spectrogram in one second we can do 1/(fft_hop_frames/input_sampling_rate)
.
K is the number of Mel bins, by default it's 80
So if you set fft_hop_frames
really low without changing the sampling rate, you'll get memory errors potentially.
Worst case:
T
= max_wav_length
/ (fft_hop_frames
/ input_sampling_rate
)
batch_size
T
n_mels
@roedoejet Is the worst case missing * n_mels
? Looking at https://github.com/roedoejet/EveryVoice/issues/150#issuecomment-1804492605, the formula is B * T * K
where K
is n_mels
which is missing from the worst case's equation.
Worst case: batch_size * max_wav_length / (fft_hop_frames / input_sampling_rate)
Should the worst case be: batch_size * max_wav_length / (fft_hop_frames / input_sampling_rate) * n_mels
?
yes! sorry - I shouldn't have written that while talking in the meeting! good catch! I'll edit the comment in case we come across it again
something that checks that this number isn’t too big:
n_mels * 1/(fft_hop_frames / input_sampling_rate)
maybe also in relation tomax_audio_length