IpsumDominum / Pytorch-Simple-Transformer

A simple transformer implementation without difficult syntax and extra bells and whistles.
37 stars 4 forks source link

add beam search #4

Open shouldsee opened 2 years ago

shouldsee commented 2 years ago

Without a decoding method one cannot actually uses the trained network to translate... The greedy decoding requires a very good network.

shouldsee commented 2 years ago

Decoding is indeed an interesting process..... I tried both MCMC and beam search during training time but they don't work well using a sequence-independent decoding scheme... Training time decoding really is a Baum Welch process and this is really fun topic to explore.

IpsumDominum commented 2 years ago

Alright I'll add Beam search some time soon. :)

Indeed it is interesting to explore how decoding schemes can effect training. Have you tried using Label smoothing loss as shown here : http://nlp.seas.harvard.edu/2018/04/03/attention.html? This should help with beam search I think.

I'm not very familiar with the Bayesian perspective, so I cannot comment much on MCMC and so forth :) But I'm interested to hear if you wish to elaborate on what you did, your findings and thoughts.

shouldsee commented 2 years ago

Thanks. I was pursuing a rather convoluted path trying to fit a completely generative model without any luck (of cause it wont work because it's missing the E-step of the EM algorithm). Currently I am studying sequence-level training under a HMM framework so that the decoding is trained in its test-form during training time. Sequence-level training will be somewhat different from the simple approach of predicting-the-next training scheme, but I am expecting quite a bit of fun.

Beam search is a heuristics to approximate the maximum-likelihood emission sequence. In fact to be exact one should really use viterbi decoding but it's simply too computationally-expensive... That's why I will prefer to keep the decoding in the latent variable instead of in the word space, this way I only need to work with 64-dim vector instead of a 3069-dim one hot vector.... But it is still unclear on how to decode in continuous multi-dimensional vector space... In any case, neural nets are funny beings that predict next-word distribution incredibly well, and I will be trying to extend the probabilistic model beyond the softmax function