Closed Blair-Johnson closed 1 year ago
Initial benchmarking indicates that batching enables significantly sub-linear scaling at least until batch_size=16
on a NVIDIA A100 80GB. Batching remains "sub-linear" with respect to batch_size=1
afterward, but the time required for a set of batched audio clips begins scaling linearly with additional increases in batch size as the GPU becomes saturated. In this figure, a 214min podcast was batched together with itself for different batch sizes in {1,2,4,8,16} and transcribed in-parallel. The linear reference assumes linear scaling with respect to the batch_size=1
case and is analogous to running consecutive clips serially.
Opening PR for merging and conflict resolution. The
batch-processing
branch introduces a single major modification to the behavior of themodel.transcribe()
method, which can now accept a list of audio file paths rather than a single audio file path. These files are packed into the batch dimension of the model for transcription, allowing users to achieve better GPU utilization. Audio clips can be different lengths and the internal batch size will be reduced as the transcription of sorter files is completed.Remaining issues to address: