Open faniuy opened 4 years ago
I use another language and it similar to me. my ppl is not converged about hyper parameters that I tried and also author used.
@faniuy code_prob's best is maybe 2.
result["code_perplexity"] = torch.exp( -torch.sum(hard_probs * torch.log(hard_probs + 1e-7), dim=-1) ).sum()
Best ppl of code is 1 but the number of group is 2.
@zelabean But it shouldn't be 2.0 just after a few thousand updates. Somethings' not right. And what puzzles me most is that the wav2vec in the master consumes half the Vram compared to wav2vec in tag v0.9.0 for the same settings. That is weird. I don't know what happened yet. Too many code changes since 0.9.0.
Seems like the projection layers in gumbel vector quantizer consumed all the gradient, become over-parameterized. Setting vq-depth to 1 makes things a little better since the prob perplexity no longer drops straight to 2.0. But I still can't get the loss down.
@faniuy Thanks. now, I using vq-depth 1 and no learning rate annealing with low learning rate. if result good, I will announce you
🐛 Bug
To Reproduce
Steps to reproduce the behavior (always include the command you ran):
Run train.py with exact same arguments specified in examples/wav2vec/README.md vq-wav2vec stops converging after a few thousand updates, where the loss is around 4.x, the prob_perplexity and code_perplexity stop at 2.0. wav2vec loss drops significantly slower than the v0.9.0 tag in the same environment and dataset, and uses about half the Vram compared to the v.0.9.0 tag for the same settings.
Code sample
Expected behavior
Loss should converge to 0.x as code in v0.9.0 tag
Environment
pip
, source): github clone & pip install -e .Additional context