Quesiton about inference

Hi authors, thanks for your such great work. However, I met some issues in the process of inference. Given an image and a text, then using it to inference, like the demo code you released in mPLUG-Owl2. In my experiment, it could inference successfully. But I just need the final result, i.e., the output text after decoding. However, it seems like the single token genereted by forward method appears at the same time. To be honest, I have checked the code you released, I have not found any relevant code, can you provide some help?

X-PLUG / mPLUG-Owl

Quesiton about inference #239