bshall / hubert

HuBERT content encoders for: A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion
https://bshall.github.io/soft-vc/
MIT License
323 stars 53 forks source link

Confusion about pads and alignment #11

Closed splinter21 closed 7 months ago

splinter21 commented 11 months ago

Great work! I have some confusion about pads https://github.com/bshall/hubert/blob/main/hubert/model.py#L81

After padding, the shape is the same as the spec whose sample rate and hop size are the same as those of hubert. But shape of hubert in fairseq is less than that of softvc_hubert 1.

e.g. 16k sr+320hop size, in the temporal dimension spec: 250 soft_hubert: 250 fairseq_hubert: 249 When using fairseq_hubert, I usually cut the tail of spec to align hubert. It seems that because of padding, we don't need cut the tail of spec when using soft_hubert. I don't know which way is better for alignment (pad input wav of hubert or cut spec). Can you give us some suggestion?