Closed tosho closed 1 month ago
hi tosho, may I know what model you are using? Based on the message, it seems that the model doesn't have vision capabilities, you sure the model is multi-modal?
I tried llama3.1:8b and mistral:latest from Ollama.
edit: I misclicked the comment ;(
hi, tosho, llama3.1 and mistral don't have vision, that's the reason you are seeing that response. You can find a list of models with vision capabilities here: https://ollama.com/search?c=vision I've tried Llava on my Windows 11 and it works fine with images.
So I just tried openbmb-minicpm-llama3-v-2_5 and wasn't getting good results then switched to llava-llama3 which was better in the first image but then when adding a second one it kinda mixes the response with the first image then I add a third image and again it gives previous mixed responses - this is with the two models.
Can this be caused by the settings or it's from the app?
Do you mind post here screenshots of your conversation as well as the model options? You can find model options by clicking the small "settings" icon in the input box
Settings page: Default context length: 100, Default max tokens: 25600
used default settings in the chat first, now I tried to with context_length:11, temp:0.2 else on default but it gives the same mixing results. these are 4 images in one chat. the prompt is : describe the image
first image is an old selfie of me: "In the image, a young man is captured in a close-up shot. His face, adorned with dark hair and a beard, fills most of the frame. He's wearing a white tank top that contrasts with his skin tone. His gaze is directed straight at the camera, creating an engaging connection with the viewer. A slight smile graces his features, adding a touch of warmth to the image. The background is a simple green wall, providing a calm and uncluttered backdrop that allows the focus to remain on the man's face. There are no discernible texts or other objects in the image. The relative position of the man against the wall suggests he is standing quite close to it. The image exudes a sense of casualness and simplicity."
second image: is a movie screenshot of a woman on an empty road surrounded by a forest "In the image, a young man is seen walking on a winding road that cuts through a forest. The road, painted in a dark blue color with white lines marking its path, curves gently to the left, disappearing into the distance. The man, dressed in a white tank top and sporting dark hair, walks towards the camera, his gaze fixed directly at it. The forest surrounding him is dense with trees, their leaves displaying a mix of orange and brown hues, suggesting that the season could be autumn. Above them, the sky is filled with clouds, adding to the serene atmosphere of the scene. There are no discernible texts or other objects in the image. The man's position on the road and his direction of movement suggest he might be embarking on a journey through this tranquil forest."
third image is a photo of a sky with clouds at dawn: "The image captures a serene moment in nature. A young man, dressed in a white tank top and sporting dark hair, is seen walking on a winding road that cuts through a forest. The road, painted in a dark blue color with white lines marking its path, curves gently to the left, disappearing into the distance. The man walks towards the camera, his gaze fixed directly at it, creating an engaging connection with the viewer.
The forest surrounding him is dense with trees, their leaves displaying a mix of orange and brown hues, suggesting that the season could be autumn. Above them, the sky is filled with clouds, adding to the tranquil atmosphere of the scene. The image does not contain any discernible texts or other objects. The man's position on the road and his direction of movement suggest he might be embarking on a journey through this peaceful forest."
the forth image is a photo of a cloud and sky: "The image captures a serene moment in nature. A young man, dressed in a white tank top and sporting dark hair, is seen walking on a winding road that cuts through a forest. The road, painted in a dark blue color with white lines marking its path, curves gently to the left, disappearing into the distance. The man walks towards the camera, his gaze fixed directly at it, creating an engaging connection with the viewer.
The forest surrounding him is dense with trees, their leaves displaying a mix of orange and brown hues, suggesting that the season could be autumn. Above them, the sky is filled with clouds, adding to the tranquil atmosphere of the scene. The image does not contain any discernible texts or other objects. The man's position on the road and his direction of movement suggest he might be embarking on a journey through this peaceful forest.
The image is taken from a low angle, which emphasizes the height of the trees and gives a sense of depth to the scene. The colors in the image are predominantly blue and orange, creating a warm and inviting atmosphere. The man's white tank top stands out against the darker colors of the road and the forest, drawing attention to his presence in the scene.
Overall, the image presents a tranquil scene of a young man walking on a winding road through a forest during what appears to be autumn, given the orange and brown hues of the leaves. The low angle shot adds depth to the image, emphasizing the height of the trees and creating a sense of scale. The colors in the image are predominantly blue and orange, with the white tank top of the man providing a stark contrast."
If I restart the conversations and used only one image per chat it's doing Ok.
P.S. Shouldn't there be default settings per model when you add the model? Can there be "Browse" button besides "Drag&Drop"? Can the left-sidebar have a fixed position?
I think the problem is with context length: with context length = 100, the model will see all previous images when you send a new one. And the model is probably not understanding the prompt well, so instead of "describing the latest image", it is "describing all images".
Try setting the context length to 1 and see if it behaves as you expect.
More about the context length: all models are basically stateless, so for it to "remember" the chat history, we have to send previous messages together with the last message (the one you just typed in the input box).
The Context Length setting defines how many messages are sent to the model:
The reason that we are sending rounds of previous messages is because some models, e.g. Claude, require the chat history to begin with an user message, or it will complain.
Hope this clarifies your question.
And the feature requests:
Thanks for the feedback.
One last (unrelated) thing. I'm not opening another issue. I'm closing it with this one.
There is bug I think with the Settings: When I change something in Settings hit Save then go to either Models, Prompts or Conversations and back to Settings the settings didn't get saved. But it works when I change something in Settings click Save then exit and run the program again.
That's indeed a bug, the Settings data is incorrectly cached. Will be fixed in next release. Thanks for pointing it out
just tried this on Windows 10 Pro x64. Adding image and asking about the image doesn't work. "I can't see the image you're referring to. I'm a large language model, I don't have the capability to view images or access external data. However, I can try to help you if you describe it to me!
If you'd like to provide more context about the image, such as where you saw it or what it's supposed to be (e.g., a photograph, illustration, etc.), I may be able to make an educated guess about its content and assist you further." " I cannot see images directly, but I can help you describe or analyze an image if you tell me what it is about. Please provide a brief description of img-0 so I can assist you better." I tried with 2 LLMs.