danpovey / pocolm

Small language toolkit for creation, interpolation and pruning of ARPA language models
Other
90 stars 48 forks source link

Multiple dev sets. #76

Open francisr opened 7 years ago

francisr commented 7 years ago

What should I do if I have multiple dev sets of unequal sizes, but I want them to contribute equally to the optimisation? So that if one set is 10 times bigger I don't want it to be 10 times more important than the other ones .

danpovey commented 7 years ago

You could just repeat the more-important dev data.

On Wed, Oct 12, 2016 at 5:59 AM, Rémi Francis notifications@github.com wrote:

What should I do if I have multiple dev sets of unequal sizes, but I want them to contribute equally to the optimisation? So that if one set is 10 times bigger I don't want it to be 10 times as important.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/76, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu5bJyIVDSiDcNJmC_U3Xsc8GYGM-ks5qzK-QgaJpZM4KUk7i .

francisr commented 7 years ago

Not the most convenient solution when I have many sources with various amount of data, but it'll probably do.
What is the impact of the dev set on the final model? As in, for the same training text, how much do you expect perplexities to vary with different dev sets?

danpovey commented 7 years ago

The dev set will definitely affect the interpolation weights of the different data sources which will make a difference in some applications. And of course the perplexity on the dev set itself will be highly dependent on the nature of the dev set (word length, domain, etc.)

On Thu, Oct 13, 2016 at 6:54 AM, Rémi Francis notifications@github.com wrote:

Not the most convenient solution when I have many sources with various amount of data, but it'll probably do.

What is the impact of the dev set on the final model? As in, for the same training text, how much do you expect perplexities to vary with different dev sets?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/76#issuecomment-253480915, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu5a9DmyGocZVOIo_-VrcR3ml7yP2ks5qzg3IgaJpZM4KUk7i .

francisr commented 7 years ago

What if there is only one data source?

On 13 October 2016 at 17:38, Daniel Povey notifications@github.com wrote:

The dev set will definitely affect the interpolation weights of the different data sources which will make a difference in some applications. And of course the perplexity on the dev set itself will be highly dependent on the nature of the dev set (word length, domain, etc.)

On Thu, Oct 13, 2016 at 6:54 AM, Rémi Francis notifications@github.com wrote:

Not the most convenient solution when I have many sources with various amount of data, but it'll probably do.

What is the impact of the dev set on the final model? As in, for the same training text, how much do you expect perplexities to vary with different dev sets?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/76#issuecomment-253480915, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu5a9DmyGocZVOIo_- VrcR3ml7yP2ks5qzg3IgaJpZM4KUk7i

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/76#issuecomment-253567953, or mute the thread https://github.com/notifications/unsubscribe-auth/AB-8ZI34iXHD27AG5Apd4u3nxBIzdASUks5qzl50gaJpZM4KUk7i .

danpovey commented 7 years ago

If there is one data source I doubt very much that the dev set would make any difference.

On Thu, Oct 13, 2016 at 12:41 PM, Rémi Francis notifications@github.com wrote:

What if there is only one data source?

On 13 October 2016 at 17:38, Daniel Povey notifications@github.com wrote:

The dev set will definitely affect the interpolation weights of the different data sources which will make a difference in some applications. And of course the perplexity on the dev set itself will be highly dependent on the nature of the dev set (word length, domain, etc.)

On Thu, Oct 13, 2016 at 6:54 AM, Rémi Francis notifications@github.com wrote:

Not the most convenient solution when I have many sources with various amount of data, but it'll probably do.

What is the impact of the dev set on the final model? As in, for the same training text, how much do you expect perplexities to vary with different dev sets?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/76#issuecomment-253480915, or mute the thread https://github.com/notifications/unsubscribe- auth/ADJVu5a9DmyGocZVOIo_- VrcR3ml7yP2ks5qzg3IgaJpZM4KUk7i

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/76#issuecomment-253567953, or mute the thread https://github.com/notifications/unsubscribe-auth/AB- 8ZI34iXHD27AG5Apd4u3nxBIzdASUks5qzl50gaJpZM4KUk7i .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/76#issuecomment-253568849, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu9ux6iWyLuEfdqbRTXPjPqPOqQh4ks5qzl9IgaJpZM4KUk7i .