Change how the metaparameter optimization works to use reparameteriza…

danpovey commented 8 years ago

…tion as an unconstrained problem, and no barrier.

danpovey commented 8 years ago

@vince62s, perhaps you could try this on your case where it was optimizing slowly? I think this may turn out to be a more robust solution but it would be nice to have you test it a bit.

I reparameterize it as an unconstrained problem, that's related to the actual parameters via sigmoids, and multiplication, so that the constraints are naturally satisfied.

vince62s commented 8 years ago

Dan, It works much better. The warm start optimization reaches a very good point and is 20-25% faster. Then the optimization step works fine (reaches the same convergence as when I used a delta of 5e-6 or 1e-6) but it took 3 steps (each step being the same speed as before) instead of 20 steps. So overall it went much quicker and reaches the good (expected) perplexity.

For reference here: out-of-domain corpus : 2.6B words of news in-domain corpus : tedlium training data test set : dev set from tedlium unpruned PPL (4-grams) : 159

For comparison, with cantab corpus instead of the news 2.6B corpus: PPL is 169.7

For some reason I do NOT keep the same competitive advantage after pruning, which is not good.

danpovey commented 8 years ago

Thanks! Before I merge this, would you mind pulling again from this branch and re-testing? I realized that the change of variables changes the scale of the gradients, which meant that the gradient-tolerance values are not equivalent to the way they were before. I divided them by 4 and pushed a change just now. Would you mind-retesting? I'm concerned that it's possible in principle that the speed gains were from a larger effective gradient tolerance. [but this would only have been the case if it was the actual gradient-tolerance that terminated the two optimizations--- it should be clear from the logs, it says whether it was that or the 'progress' that is the cause.]

Dan

On Fri, Oct 14, 2016 at 10:07 AM, vince62s notifications@github.com wrote:

Dan, It works much better. The warm start optimization reaches a very good point and is 20-25% faster. Then the optimization step works fine (reaches the same convergence as when I used a delta of 5e-6 or 1e-6) but it took 3 steps (each step being the same speed as before) instead of 20 steps. So overall it went much quicker and reaches the good (expected) perplexity.

For reference here: out-of-domain corpus : 2.6B words of news in-domain corpus : tedlium training data test set : dev set from tedlium unpruned PPL (4-grams) : 159

For comparison, with cantab corpus instead of the news 2.6B corpus: PPL is 169.7

For some reason I do NOT keep the same competitive advantage after pruning, which is not good.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/pull/78#issuecomment-253809824, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu3axVII3AU98qqtnTEKdlmmatXPGks5qz4yKgaJpZM4KWleV .

vince62s commented 8 years ago

same results. Did you see my point on the PPL after pruning ? Do you think there is a way to make this more consistent ? or could it just be from the data sources.

vince62s commented 8 years ago

Let me be more specific. 2.6B words + Tedlium : unpruned 159, pruned at 40M n-grams: 174 Cantab + Tedlium : unpruned 165, pruned at 40M n-grams : 167.6

my question relates more to: when pruning does it take into account the dev set for better optimization ?

danpovey commented 8 years ago

The pruning actually doesn't take into account the dev set at all, it's based on divergence from the model before pruning. Any detailed use of the dev set, and you'd essentially be training on the dev set. However, the source interpolation weights are based on the dev set, and it will naturally prune away lower-count states, so in that sense it does make use of the dev set. I actually don't see anything surprising in the results you report. The 2.6B words are further from the tedlium data in topic, so they give you a worse model at any particular model size; yet, you have a bigger model because there are more words, so the unpruned model is better.

BTW, this issue is a bit orthogonal from the pull request (unless I misunderstand something). It would be great if you could re-run after pulling that change, and see if the optimization is still faster. Then I'll be more confident to merge.

On Sat, Oct 15, 2016 at 11:58 AM, vince62s notifications@github.com wrote:

Let me be more specific. 2.6B words + Tedlium : unpruned 159, pruned at 40M n-grams: 174 Cantab + Tedlium : unpruned 165, pruned at 40M n-grams : 167.6

my question relates more to: when pruning does it take into account the dev set for better optimization ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/pull/78#issuecomment-253992898, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVuyCvIYOO6TVM3qtcuipk-n0kOf0hks5q0PhAgaJpZM4KWleV .

vince62s commented 8 years ago

oh you didn't see my first comment "same results" Yes I did rerun it and gives me the same results, speed wise and ppl wise. you can merge.

vince62s commented 8 years ago

my point was on the interpolation weights, I was thinking that in some ways at pruning time there could be some "adjustement" of these weights based on the pruned n-grams. But never mind.

danpovey commented 8 years ago

That doesn't really mesh with how the whole thing is implemented, I'm afraid-- by that time you no longer keep track of the data sources. Otherwise it would be a reasonable thing to try.

On Sat, Oct 15, 2016 at 2:37 PM, vince62s notifications@github.com wrote:

my point was on the interpolation weights, I was thinking that in some ways at pruning time there could be some "adjustement" of these weights based on the pruned n-grams. But never mind.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/pull/78#issuecomment-254002796, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu1xtlF5mR1Ja6PO6meFWfp3idiLVks5q0R2EgaJpZM4KWleV .

danpovey / pocolm

Change how the metaparameter optimization works to use reparameteriza… #78