From the first three constants, we can compute num_frames = AUDIO_SHAPE / SR * FPS = 67267 / 16000 * 15 = 63.06281249999999 which is about one whole frame less than FRAMES_PER_SAMPLE.
We have encountered this problem when we were trying to test the model on a longer audio sequence, for which the misalignment is magnified.
According to the file
common/consts.py
, we know thatFrom the first three constants, we can compute
num_frames = AUDIO_SHAPE / SR * FPS = 67267 / 16000 * 15 = 63.06281249999999
which is about one whole frame less thanFRAMES_PER_SAMPLE
.We have encountered this problem when we were trying to test the model on a longer audio sequence, for which the misalignment is magnified.