Plachtaa / VALL-E-X

An open source implementation of Microsoft's VALL-E X zero-shot TTS model. Demo is available in https://plachtaa.github.io
MIT License
7.42k stars 747 forks source link

Inference: Batch size > 1 #177

Open flexthink opened 6 days ago

flexthink commented 6 days ago

It appears that batch inference is not currently supported. If the batch size is anything other than 1, inference fails

In models/vallex.py, inference():

assert y.shape[0] == 1, y.shape

This does not allow batch sizes other than 1 for audio prompts

            xy_pos = torch.concat([x, y_pos], dim=1)

This assumes the batch size is the same for x and y_pos.

Here is the error you get if the batch size is 2 and best_of is 5

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 10 but got size 5 for tensor number 1 in the list.

Please advise