Closed Li-Qingyun closed 6 months ago
The evaluation contains errors. I tried to fix like this:
class MMEDataset(BaseEvalDataset): ...... def __init__( self, data_path: str = "Otter-AI/MME", *, cache_dir: Union[str, None] = None, default_output_path: str = "./logs/MME", split: str = "test", debug: bool = False, ): super().__init__("MMEDataset", data_path) self.default_output_path = default_output_path self.cur_datetime = utc_plus_8_time.strftime("%Y-%m-%d_%H-%M-%S") self.data = load_dataset(data_path, split=split, cache_dir=cache_dir) self.debug = debug self.category_data = {} # for idx in range(len(self.ids)): for item in tqdm(self.data, desc="Loading data"): question_id = item["question_id"] # e.g. 'code_reasoning/0020.png' category = item["category"].split("_")[0].lower() image_id = question_id.split("/")[-1].replace(".png", "").replace(".jpg", "") question = item["question"] answer = item["answer"] image = item["image"] data = {"question": question, "answer": answer, "image": image} if category in eval_type_dict["Cognition"]: eval_type = "Cognition" elif category in eval_type_dict["Perception"]: eval_type = "Perception" else: raise ValueError(f"Unknown category {category} item {item}") if eval_type not in self.category_data: self.category_data[eval_type] = {} if category not in self.category_data[eval_type]: self.category_data[eval_type][category] = {} if image_id not in self.category_data[eval_type][category]: self.category_data[eval_type][category][image_id] = [] self.category_data[eval_type][category][image_id].append(data)
The obtained results is as follows, which is not the same as the results in MME leadboard repo. [Cry]
Is there any misunderstanding?
=========== Cognition =========== total score: 277.14285714285717 code score: 47.5 numerical score: 57.5 text score: 90.0 commonsense score: 82.14285714285715 =========== Perception =========== total score: 1052.6899759903963 artwork score: 92.0 celebrity score: 91.47058823529412 count score: 98.33333333333334 color score: 106.66666666666666 position score: 60.0 ocr score: 50.0 landmark score: 109.0 scene score: 161.75 existence score: 160.0 posters score: 123.46938775510205 -------------------------------------------------------------------------------- Total Datasets Evaluated: 1
The evaluation contains errors. I tried to fix like this:
The obtained results is as follows, which is not the same as the results in MME leadboard repo. [Cry]
Is there any misunderstanding?