Captions are very long and verbose

Hi,

I'm running inference on a variety of videos with the luoruipu1/Valley2-7b model and the resulting captions are always very long and contain lots of repetitive text. For the two videos given in serve/examples I'm getting the following results:

First, we see a snowmobile driving through a snowy forest with trees in the background. The snowmobile is moving quickly and smoothly through the snow. Next, we see a person riding the snowmobile, enjoying the thrill of the ride. The snowmobile is equipped with tracks in the snow, indicating its path. Then, we see the snowmobile driving through a snowy field with trees in the background. The snowmobile is moving quickly and smoothly through the snow, leaving tracks behind. Finally, we see the snowmobile driving through a snowy field with trees in the background. The snowmobile is moving quickly and smoothly through the snow, leaving tracks behind. Throughout the video, we see the beauty of the snowy landscape and the excitement of the snowmobile ride. The video captures the essence of winter sports and the joy of exploring the snowy wilderness.

First, we see a black and white cat sitting on a toilet in a bathroom. The cat appears to be looking around and observing its surroundings. Next, we see the same cat sitting on the toilet, but this time it seems to be more focused on the toilet itself. The cat is still sitting on the toilet in the following shot, but it appears to be looking down at the floor. Then, we see the cat sitting on the toilet again, but this time it seems to be looking up at the ceiling. In the next shot, the cat is still sitting on the toilet, but it appears to be looking at the camera. Finally, we see the cat sitting on the toilet once more, but this time it seems to be looking down at the floor again. Throughout the video, the cat remains calm and composed, and it does not appear to be startled or disturbed by the presence of the camera.

Is the result supposed to be like this? I was hoping more for a concise caption that explains what is happening in the video in 2-3 sentences. I tried changing the text prompt but it doesn't seem to make a difference to the result.

RupertLuo / Valley

Captions are very long and verbose #34