Open cassiaaaaaa opened 4 months ago
Text generation is not involved in training
Therefore, beam search is only depending on inference
For the demo, we set topp and temperature for diverse text generation
We didn't use specific initializer for MoE
Dear author, I see your new work Meteor, its so awesome! But I still have questions with MoAI. Sorry to bother again.
The first one: in the paper, you mentioned you used beamsearch=3 in generation, but I see the demo you used top_p=0.95. Is beamsearch=3 used in training? But In training, using beamsearch seems not common.
The second one: What type of initialization is used for MoE (the six experts) in the second training step?