datagenration- 30k of your samples are being preprocessed incorrectly

Hello,

I realized that Monkey/data_generation/amg.py is using the basename for each image as the name for the SegmentAnythingModel output json file. For example the json file generated for monkey/data_generation/images/scienceqa/images/train/1/image.png is named image.json stored in the masks folder.

The source of the error is coming from line 228 in data_generation/amg.py

  name=t.split('/')[-1].split('.')[0] # --> the source of this logical bug
  save_base=os.path.join(args.output, name)
  if output_mode == "binary_mask":
      os.makedirs(save_base, exist_ok=False)
      write_masks_to_folder(masks, save_base)
  else:
      save_file = save_base + ".json"
      with open(save_file, "w") as f:
          json.dump(masks, f) # --> this overwrites previously similar json files if name is identical

However, the problem is that there are multiple image.png in images/scienceqa/images/train/, therefore the amg.py script keeps overwriting the image.json in the masks folder. As a consequence all the similar basenames gets processed incorrectly. In other word whatever scripts that is built on amg.py is incorrect, e.g sam_blip.py.

I computed the number of similar basenames in the image folder. Out of 617052 images there are only 587077 unique basenames. Nearly 30k basenaes are similar, therefore their SAM json files are overwritten on top of each other. This is 4 percent of your data for which wrong ChatGPT long description is generated at the end of your pipeline only because of this logical bug.

Yuliang-Liu / Monkey

datagenration- 30k of your samples are being preprocessed incorrectly #113