We currently generate a sentence by taking the argmax prediction at each timestep. However, the highest probability sentence is not always found in the argmax at each timestep. In fact, Karpathy and Fei-Fei (2015) argue that a beam of seven can increase CIDEr from 0.61 to 0.66.
We should implement a beam search decoder, where the size of the beam as a free parameter in the model.
We currently generate a sentence by taking the argmax prediction at each timestep. However, the highest probability sentence is not always found in the argmax at each timestep. In fact, Karpathy and Fei-Fei (2015) argue that a beam of seven can increase CIDEr from 0.61 to 0.66.
We should implement a beam search decoder, where the size of the beam as a free parameter in the model.