Dongping-Chen / MLLM-Judge

[ICML 2024 Oral] Official code repository for MLLM-as-a-Judge.
https://mllm-judge.github.io/
52 stars 4 forks source link

请问Dataset中的Lite数据是用于什么呢? #4

Open wyclike opened 2 months ago

wyclike commented 2 months ago

还想问一下这个字段的含义 {"judge_Evaluator": 5, "judge_human": 1, "analysis": 5, "hallucination": 0}

Dongping-Chen commented 2 months ago

Hi, Lite is a smaller subset we intend to propose for an easier benchmark process. As for each key in JSON dict, "Judge_Evaluator" is the score from human annotator, ranging from 1-5, which serves as ground truth in benchmarking MLLM serving as a Judge; "judge_human" is for human agreement evaluation, which is a qualitative experiment, 1 is agree and 0 is disagree; "analysis" is for human evaluation for the "analysis" phase in MLLM's "analyze-then-judge" CoT, ranging from 1-5, where 1 is worst and 5 is best; "hallucination" is human evaluation for whether there is a hallucination in MLLM's judgment, 0 is no and 1 is yes.

wyclike commented 2 months ago

Thanks! But i still a bit confused. In the HQ/Score.jsonl, it has the filed of "Human_answer" and "evaluator" ,and then you said that the "judge_Evaluator" is the score from human annotator ,but is it also the "Human_answer" is the score from human annotator? And they have a different score value, is this contradictory?

Dongping-Chen commented 2 months ago

I apologize for the confusion regarding the key names in the JSON dictionary. "Human_answer" refers to the ground truth of the triple, which includes the {image, instruction, one/two/multiple responses}. "Judge_Evaluator" is a key from an older version of human annotation and should be discarded. We annotate a sample a third time, with "Human_answer" being the final result and "Judge_Evaluator" being part of the previous annotations. Thank you for the reminder; I will remove the redundant keys as soon as possible.

wyclike commented 2 months ago

Ok,i got it

wyclike commented 2 months ago

Hey,I got problem that unable to reproduce the result. When I try to use gpt-4v to test the score task,i found the result that gpt-4v is alaways repate "[[1]]". and here is my config: python api_benchmark.py --model gpt-4v --judge_mode 'score' --temperature 0.4 --top_p 0.2 --image_root /root/MLLM-Judge/Dataset/image/ --api xxxxx

Here are ten data entries from the output——score_gpt-4v.jsonl: {"score_id": 0, "name": "llava", "answer": "In the image, there is a slice of lime on the tray.", "id": 100, "instruction": "Please analyse this figure in detail and answer the following question based on this figure: What fruit is shown?", "image_path": "100.jpg", "original_dataset": "coco", "human": 3, "mllm_judge": "[[1]]"} {"score_id": 1, "name": "cogvlm", "answer": "The fruit shown in the image is lime. It is identifiable by its green color, the characteristic shape of its leaves, and the small, round green fruit itself.", "id": 100, "instruction": "Please analyse this figure in detail and answer the following question based on this figure: What fruit is shown?", "image_path": "100.jpg", "original_dataset": "coco", "human": 2, "mllm_judge": "[[1]]"} {"score_id": 2, "name": "gpt4", "answer": "The image shows a meal with various items, but it does not appear to contain any fruit. Instead, there is a sandwich which seems to be a Cuban sandwich cut in half, rice that appears to be yellow rice, possibly seasoned with saffron or a similar spice, and a side of black beans. Additionally, there is an item that looks like an ear of corn, which is seasoned and possibly grilled, served with a lime wedge on the side. The corn is the closest item to a fruit here, as botanically speaking, corn kernels are the seeds of a type of grass and considered a grain, and grains are a type of fruit in the botanical sense. However, in culinary terms, corn is generally not referred to as a fruit. Thus, there is no typical culinary fruit depicted in this image.", "id": 100, "instruction": "Please analyse this figure in detail and answer the following question based on this figure: What fruit is shown?", "image_path": "100.jpg", "original_dataset": "coco", "human": 4, "mllm_judge": "[[1]]"} {"score_id": 3, "name": "gemini", "answer": " The image shows a plate with a Cuban sandwich, rice with black beans, corn on the cob, and a lime wedge. The sandwich is made of two halves of a Cuban bread roll with ham, pork, swiss cheese and pickles. The rice is yellow rice. The corn on the cob is topped with a sprinkle of cotija cheese and chili powder. The lime wedge is used to squeeze lime juice over the corn. The image does not show any fruit.", "id": 100, "instruction": "Please analyse this figure in detail and answer the following question based on this figure: What fruit is shown?", "image_path": "100.jpg", "original_dataset": "coco", "human": 2, "mllm_judge": "[[1]]"} {"score_id": 4, "name": "llava", "answer": "The meal in the image is a pizza, which is served in a box. The pizza is placed on a dining table, accompanied by a cup of yogurt and a napkin.", "id": 101, "instruction": "Please analyse this figure in detail and answer the following question based on this figure: What is the name of this meal?", "image_path": "101.jpg", "original_dataset": "coco", "human": 3, "mllm_judge": "[[1]]"} {"score_id": 5, "name": "cogvlm", "answer": "The name of this meal is not explicitly mentioned in the image. However, based on the contents, it appears to be a type of pizza or flatbread topped with cheese and possibly some herbs or vegetables.", "id": 101, "instruction": "Please analyse this figure in detail and answer the following question based on this figure: What is the name of this meal?", "image_path": "101.jpg", "original_dataset": "coco", "human": 4, "mllm_judge": "[[1]]"} {"score_id": 6, "name": "gpt4", "answer": "The image depicts an airplane meal consisting of a main dish that appears to be lasagna, with its characteristic layers and melted cheese on top. The meal is served in an open cardboard box which has a picture of the meal on the lid, making it easier to identify. Accompanying the main dish is a plastic-wrapped cutlery set, indicating that this is a prepared meal likely served to passengers during a flight. On the right side of the tray table is a cup with what seems to be a hot beverage, possibly tea or coffee, covered with a plastic lid and with a logo that might indicate the airline or the beverage brand. The setting suggests that this is an economy class meal given the simplicity of the presentation and packaging. However, due to the angle and quality of the image, it is difficult to provide a definitive identification of the meal's name beyond a reasonable guess that it is lasagna.", "id": 101, "instruction": "Please analyse this figure in detail and answer the following question based on this figure: What is the name of this meal?", "image_path": "101.jpg", "original_dataset": "coco", "human": 5, "mllm_judge": "[[1]]"} {"score_id": 7, "name": "gemini", "answer": " This is a picture of a meal served on a Delta flight. The meal consists of a cheese pizza, a small salad with dressing, and a dessert. The pizza is made with a thin crust and is topped with cheese and mushrooms. The salad is made with lettuce, tomatoes, and cucumbers. The dessert is a chocolate chip cookie.", "id": 101, "instruction": "Please analyse this figure in detail and answer the following question based on this figure: What is the name of this meal?", "image_path": "101.jpg", "original_dataset": "coco", "human": 3, "mllm_judge": "[[1]]"} {"score_id": 8, "name": "llava", "answer": "In the image, the computer is running a Windows operating system. This conclusion can be drawn from the presence of a mouse on the desk, which is a standard peripheral for Windows-based computers. Additionally, there is a keyboard on the desk, which is also commonly used with Windows systems. The presence of these peripherals, along with the desktop computer setup, indicates that the computer is running a Windows operating system.", "id": 102, "instruction": "Please analyse this figure in detail and answer the following question based on this figure: What kind of system is the computer running?", "image_path": "102.jpg", "original_dataset": "coco", "human": 2, "mllm_judge": "[[1]]"} {"score_id": 9, "name": "cogvlm", "answer": "The computer appears to be running a Windows operating system, as evidenced by the recognizable Windows logo on the bottom right corner of the screen.", "id": 102, "instruction": "Please analyse this figure in detail and answer the following question based on this figure: What kind of system is the computer running?", "image_path": "102.jpg", "original_dataset": "coco", "human": 1, "mllm_judge": "[[1]]"}

Have you met the same problem?

Dongping-Chen commented 2 months ago

Hi, I modify the scripts/api_benchmark.py file to save the raw response of the Judge MLLMs instead of only the judgment.

This is my output

{"id": 0, "pair_id": 0, "image_path": "0.jpg", "original_dataset": "coco", "instruction": "Please analyse this figure in detail and answer the following question based on this figure: Why are the men bending down?", "answer1": {"name": "llava", "answer": "In the image, the men are bending down to pick up a baseball that has been hit and is now on the ground. This is a typical scene in a baseball game, where the batter has just hit the ball and the fielders are trying to catch or pick up the ball to get the batter or other runners out. The presence of a baseball bat, baseball glove, and sports ball in the image further supports the context of a baseball game."}, "answer2": {"name": "cogvlm", "answer": "The men are bending down because they are involved in a baseball game. The batter is preparing to hit the ball thrown by the pitcher, and the catcher is positioned behind the batter to catch the ball if the batter misses or decides not to swing. The umpire is closely observing the pitch to make a call on whether it's a strike or a ball."}, "human_answer": "B", "mllm_judge": "{\n \"Analysis\": \"Assistant A's response suggests that the men are bending down to pick up a baseball that has been hit and is now on the ground. This interpretation is incorrect as the image shows a batter preparing to hit the ball, with the catcher and umpire in their respective positions. Assistant B's response correctly identifies the context of a baseball game, explaining that the batter is preparing to hit the ball, the catcher is positioned to catch the ball, and the umpire is observing the pitch. Assistant B's response accurately describes the roles and actions of the individuals in the image, aligning with the user's instruction to analyze why the men are bending down.\",\n \"Judgement\": \"[[B]]\"\n}"} {"id": 0, "pair_id": 1, "image_path": "0.jpg", "original_dataset": "coco", "instruction": "Please analyse this figure in detail and answer the following question based on this figure: Why are the men bending down?", "answer1": {"name": "llava", "answer": "In the image, the men are bending down to pick up a baseball that has been hit and is now on the ground. This is a typical scene in a baseball game, where the batter has just hit the ball and the fielders are trying to catch or pick up the ball to get the batter or other runners out. The presence of a baseball bat, baseball glove, and sports ball in the image further supports the context of a baseball game."}, "answer2": {"name": "gpt4", "answer": "The men are bending down as a part of their roles in a baseball game. The man wearing the dark protective gear and squatting is the catcher. It is his role to catch pitches that the batter does not hit and to provide the target for the pitcher. The bending down position allows him to be in a ready stance to react to the pitched ball and to stay low to give the umpire a clear view of the strike zone. The other man in a squatting position near the batter, wearing a face mask and other protective gear, is the umpire. His role includes calling balls and strikes, and he squats to get a close view of the strike zone and better judge the pitches as they come into the catcher's mitt. This image captures them during a live play, where the batter is preparing to attempt a hit on the incoming pitch."}, "human_answer": "B", "mllm_judge": "{\n \"Analysis\": \"Assistant A's response suggests that the men are bending down to pick up a baseball that has been hit and is now on the ground. This explanation is not accurate based on the image, as the men are not in the process of picking up a ball. Instead, they are in their respective positions during a baseball game. Assistant B's response correctly identifies the roles of the men in the image: one is the catcher and the other is the umpire. The catcher is in a squatting position to catch pitches and provide a target for the pitcher, while the umpire is also squatting to get a clear view of the strike zone. This explanation is more accurate and detailed, providing a clear understanding of why the men are bending down in the context of a baseball game.\",\n \"Judgement\": \"[[B]]\"\n}"} {"id": 0, "pair_id": 2, "image_path": "0.jpg", "original_dataset": "coco", "instruction": "Please analyse this figure in detail and answer the following question based on this figure: Why are the men bending down?", "answer1": {"name": "llava", "answer": "In the image, the men are bending down to pick up a baseball that has been hit and is now on the ground. This is a typical scene in a baseball game, where the batter has just hit the ball and the fielders are trying to catch or pick up the ball to get the batter or other runners out. The presence of a baseball bat, baseball glove, and sports ball in the image further supports the context of a baseball game."}, "answer2": {"name": "gemini", "answer": " The men are bending down because they are playing baseball. The batter is swinging the bat at the ball, and the catcher and umpire are\u8e72\u4e0b\u4ee5\u4fbf\u66f4\u597d\u5730\u770b\u5230\u7403\u3002"}, "human_answer": "A", "mllm_judge": "{\n \"Analysis\": \"Assistant A provides a detailed explanation, stating that the men are bending down to pick up a baseball that has been hit and is now on the ground. This explanation is supported by the context of a baseball game, mentioning the presence of a baseball bat, glove, and sports ball. However, the image does not show a ball on the ground, which makes this explanation inaccurate. Assistant B states that the men are bending down because they are playing baseball, and mentions the batter swinging the bat at the ball, with the catcher and umpire bending down to see the ball better. This explanation is more accurate as it aligns with the image, where the catcher and umpire are indeed bending down in their typical positions during a baseball game.\",\n \"Judgement\": \"[[B]]\"\n}"}

You can merge this update to your local scripts. After you get the raw output, you may need a further post-process to clean the output. Thank you for pointing out the mistakes.

wyclike commented 2 months ago

OK,i got it. I found where the problem is . I made the response=item['answer'],but actually, it should be [item['answer]]. Anyway,thanks!

wyclike commented 2 months ago

How do you control the quality of human ratings? I found some examples where the model's responses are quite good, but the human ratings are very low. Such as: Here is a question with a human rating of one point. "id": 3709, "score_id": 4715, "image_path": "3409.jpg", "original_dataset": "ScienceQA", "instruction": "Which country is highlighted?", "true_answers": "New Zealand", "name": "llava", "answer": "The country highlighted on the map is Australia. It is depicted in a different color, typically green, to distinguish it from the surrounding countries and to draw attention to it. The map also shows New Zealand to the southeast of Australia, as well as other countries in the region.", "human": "1"}

and her is a question with a human rating of 5 point. {"id": 3709, "score_id": 4712, "image_path": "3409.jpg", "original_dataset": "ScienceQA", "instruction": "Which country is highlighted?", "true_answers": "New Zealand", "name": "gpt4", "answer": "The country highlighted in green on the map is Australia. Australia is a large country and continent located in the Southern Hemisphere. It is characterized by its considerable size, distinctive outline, and location surrounded by the Indian and Pacific Oceans. The map shows Australia's mainland along with Tasmania to the south, which is also part of the country. The highlighting makes it stand out from the surrounding blue which represents the ocean and the white areas which may represent other countries not highlighted or areas outside the focal region of the map.", "human": "5"}

wyclike commented 2 months ago

Why is the human answer 'C' when there are only two answers to compare???

{"id": 6, "pair_id": 37, "image_path": "6.jpg", "original_dataset": "coco", "instruction": "Please analyse this figure in detail and answer the following question based on this figure: Is this a good place to relax or to shop?", "answer1": {"name": "cogvlm", "answer": "Based on the image, this appears to be a bustling urban area with numerous signs, advertisements, and people walking around. While it's vibrant and lively, which might be appealing for shopping enthusiasts, the crowded streets and potential noise could make it less ideal for relaxation. Therefore, it might be a good place for shopping but not necessarily for relaxation."}, "answer2": {"name": "gemini", "answer": " This is not a good place to relax. It is a busy street with a lot of traffic and people. There are also a lot of bright lights and signs, which can be distracting and overwhelming. If you are looking for a place to relax, I would recommend finding a quieter place with less traffic and fewer people."}, "human_answer": "C"}

Dongping-Chen commented 2 months ago

Our pair comparison experiment includes tie situations, which we refer to as case "C". Based on the probability distribution, you can treat a tie as either side winning or just filter our all the "Tie" samples. For human scoring, we conducted three annotations for each sample and selected the option with the most choices. Previous studies have validated that human raters' annotations are noisy. Despite our efforts to control for this and the training provided to the raters, it is still difficult to completely avoid.