Closed kevjshih closed 1 year ago
Thank you for reporting the issue! :+1: You are right, do you have some measurements of the disparities? I do not maintain the code in this repo and I see only one solution: implementing a custom batch norm layer with support for ignoring the paddings like here for example.
Yep! No worries, just thought I'd bring it up in case you were interested. I can't really share my graphs directly, but we've tested with a masked implementation as you mentioned and found much better convergence rates and validation scores in general. Certainly better training stability, though it does mean that you might have to do a new search through learning rates.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
https://github.com/Tomiinek/Multilingual_Text_to_Speech/blob/76e187a84913aba0e674a1c6b5f69175fb37148a/modules/layers.py#L78
Hi there! I'm not sure if you already handle this separately, but since the encoder appears to accept batches of variable-lengthed sequences with filler values, large disparities in sequence length will drastically reduce the per-channel sample variance, which results in batchnorm1D applying a large scalar to achieve zero mean standard norm outputs.