Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
https://otter-ntu.github.io/
MIT License
3.52k stars 241 forks source link

MME result mismatch of luodian/OTTER-MPT-7B #328

Closed Li-Qingyun closed 6 months ago

Li-Qingyun commented 6 months ago

The evaluation contains errors. I tried to fix like this:

class MMEDataset(BaseEvalDataset):
    ......
    def __init__(
        self,
        data_path: str = "Otter-AI/MME",
        *,
        cache_dir: Union[str, None] = None,
        default_output_path: str = "./logs/MME",
        split: str = "test",
        debug: bool = False,
    ):
        super().__init__("MMEDataset", data_path)

        self.default_output_path = default_output_path
        self.cur_datetime = utc_plus_8_time.strftime("%Y-%m-%d_%H-%M-%S")
        self.data = load_dataset(data_path, split=split, cache_dir=cache_dir)
        self.debug = debug

        self.category_data = {}
        # for idx in range(len(self.ids)):

        for item in tqdm(self.data, desc="Loading data"):
            question_id = item["question_id"]  # e.g. 'code_reasoning/0020.png'
            category = item["category"].split("_")[0].lower()
            image_id = question_id.split("/")[-1].replace(".png", "").replace(".jpg", "")
            question = item["question"]
            answer = item["answer"]
            image = item["image"]

            data = {"question": question, "answer": answer, "image": image}

            if category in eval_type_dict["Cognition"]:
                eval_type = "Cognition"
            elif category in eval_type_dict["Perception"]:
                eval_type = "Perception"
            else:
                raise ValueError(f"Unknown category {category} item {item}")

            if eval_type not in self.category_data:
                self.category_data[eval_type] = {}

            if category not in self.category_data[eval_type]:
                self.category_data[eval_type][category] = {}

            if image_id not in self.category_data[eval_type][category]:
                self.category_data[eval_type][category][image_id] = []

            self.category_data[eval_type][category][image_id].append(data)

The obtained results is as follows, which is not the same as the results in MME leadboard repo. [Cry]

Is there any misunderstanding?

=========== Cognition ===========
total score: 277.14285714285717
     code score: 47.5
     numerical score: 57.5
     text score: 90.0
     commonsense score: 82.14285714285715
=========== Perception ===========
total score: 1052.6899759903963
     artwork score: 92.0
     celebrity score: 91.47058823529412
     count score: 98.33333333333334
     color score: 106.66666666666666
     position score: 60.0
     ocr score: 50.0
     landmark score: 109.0
     scene score: 161.75
     existence score: 160.0
     posters score: 123.46938775510205

--------------------------------------------------------------------------------
Total Datasets Evaluated: 1