Tomiinek / Multilingual_Text_to_Speech

An implementation of Tacotron 2 that supports multilingual experiments with parameter-sharing, code-switching, and voice cloning.
MIT License
826 stars 157 forks source link

batchnorm1D on padded values results in large activation scaling #81

Closed kevjshih closed 1 year ago

kevjshih commented 1 year ago

https://github.com/Tomiinek/Multilingual_Text_to_Speech/blob/76e187a84913aba0e674a1c6b5f69175fb37148a/modules/layers.py#L78

Hi there! I'm not sure if you already handle this separately, but since the encoder appears to accept batches of variable-lengthed sequences with filler values, large disparities in sequence length will drastically reduce the per-channel sample variance, which results in batchnorm1D applying a large scalar to achieve zero mean standard norm outputs.

Tomiinek commented 1 year ago

Thank you for reporting the issue! :+1: You are right, do you have some measurements of the disparities? I do not maintain the code in this repo and I see only one solution: implementing a custom batch norm layer with support for ignoring the paddings like here for example.

kevjshih commented 1 year ago

Yep! No worries, just thought I'd bring it up in case you were interested. I can't really share my graphs directly, but we've tested with a masked implementation as you mentioned and found much better convergence rates and validation scores in general. Certainly better training stability, though it does mean that you might have to do a new search through learning rates.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.