Open abrichr opened 6 months ago
Related: https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
https://github.com/OpenGVLab/InternVL appears to surpass MiniCPM on the benchmark. However, testing suggests it does not perform as well on GUI elements. From https://internvl.opengvlab.com/ (same image as above):
user upload image Describe the given image with as much detail as possible.
The image displays a calculator with a gray background and orange buttons. The calculator shows the number 45 in its display. The buttons are arranged in a typical layout with numbers 0 through 9, and function buttons such as addition, subtraction, multiplication, division, and percentage. The calculator also has buttons for clear, memory, and a decimal point. The top left corner of the calculator has three colored dots, which could indicate different modes or functions of the calculator.
Enumerate all interactable GUI elements
The interactable GUI elements in the image include:
The display showing the number "45". The "C" button, which is likely for clearing the current input or calculation. The "CE" button, which typically stands for "Clear Entry" and is used to clear the last input or part of the current input. The "MC" button, which is likely for clearing the memory or the stored value. The "M-" button, which is likely for recalling the last stored value from memory. The "M+" button, which is likely for storing the current value in memory. The "7", "8", "9" buttons for inputting numbers. The "4", "5", "6" buttons for inputting numbers. The "1", "2", "3" buttons for inputting numbers. The "0" button for inputting numbers. The "." button for inputting the decimal point. The "+" button for addition. The "-" button for subtraction. The "x" button for multiplication. The "/" button for division. The "%" button for calculating the percentage. The "=" button to execute the calculation and display the result. These buttons are designed to be pressed by the user to perform calculations and input numbers on the calculator.
Also worth considering: https://huggingface.co/microsoft/Phi-3-vision-128k-instruct
https://github.com/THUDM/GLM-4/blob/main/README_en.md
We have also launched the GLM-4-9B-Chat-1M model that supports 1M context length (about 2 million Chinese characters) and the multimodal model GLM-4V-9B based on GLM-4-9B. GLM-4V-9B possesses dialogue capabilities in both Chinese and English at a high resolution of 1120*1120. In various multimodal evaluations, including comprehensive abilities in Chinese and English, perception & reasoning, text recognition, and chart understanding, GLM-4V-9B demonstrates superior performance compared to GPT-4-turbo-2024-04-09, Gemini 1.0 Pro, Qwen-VL-Max, and Claude 3 Opus.
Feature request
https://github.com/OpenBMB/MiniCPM is the smallest multimodal model available. The latest version, https://huggingface.co/openbmb/MiniCPM-V-2, appears to be able to understand GUI images:
This model has been integrated into https://github.com/vllm-project/vllm, which is likely how we want to implement it here.
This involves creating an
openadapt.adapters.vllm
with parameterizable model name.Motivation
Fully offline support.