Yuliang-Liu / Monkey

【CVPR 2024 Highlight】Monkey (LMM): Image Resolution and Text Label Are Important Things for Large Multi-modal Models
MIT License
1.82k stars 128 forks source link

datagenration- 30k of your samples are being preprocessed incorrectly #113

Closed bnavard closed 2 months ago

bnavard commented 2 months ago

Hello,

I realized that Monkey/data_generation/amg.py is using the basename for each image as the name for the SegmentAnythingModel output json file. For example the json file generated for monkey/data_generation/images/scienceqa/images/train/1/image.png is named image.json stored in the masks folder.

The source of the error is coming from line 228 in data_generation/amg.py

  name=t.split('/')[-1].split('.')[0] # --> the source of this logical bug
  save_base=os.path.join(args.output, name)
  if output_mode == "binary_mask":
      os.makedirs(save_base, exist_ok=False)
      write_masks_to_folder(masks, save_base)
  else:
      save_file = save_base + ".json"
      with open(save_file, "w") as f:
          json.dump(masks, f) # --> this overwrites previously similar json files if name is identical

However, the problem is that there are multiple image.png in images/scienceqa/images/train/, therefore the amg.py script keeps overwriting the image.json in the masks folder. As a consequence all the similar basenames gets processed incorrectly. In other word whatever scripts that is built on amg.py is incorrect, e.g sam_blip.py.

I computed the number of similar basenames in the image folder. Out of 617052 images there are only 587077 unique basenames. Nearly 30k basenaes are similar, therefore their SAM json files are overwritten on top of each other. This is 4 percent of your data for which wrong ChatGPT long description is generated at the end of your pipeline only because of this logical bug.

echo840 commented 2 months ago

Hello, we apologize for the inconvenience. When generating the images, we placed them all in one folder with different filenames. You can resolve this by putting all the images in a single folder with distinct filenames, or by modifying the code so that it saves the output to different paths.