facebookresearch / segment-anything

The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
47.5k stars 5.62k forks source link

What is the difference between automatic mask generator and normal predictor? #607

Closed Colezwhy closed 1 year ago

Colezwhy commented 1 year ago

Is there some quality differences? And the senario i observed in my application is that the 'Anything Mode' has a much higher time cost than 'Everything mode' with the same amount of points. Why? Can I utilize 'Everyting Mode' with multiple input points for generating their respective masks? E.g. I currently have 100 points for 'Anything Mode'. And what's the difference between these masks?

heyoeyo commented 1 year ago

The automatic mask generator uses the 'normal' predictor internally. It's a bit of a mess to follow, but the (slightly simplified) sequence is:

  1. When you run the automatic mask generator (amg.py) script, it instantiates the SamAutomaticMaskGenerator class, which has a .generate(...) method on it, which is used to create all of the masks. (This is really all the amg.py script does, the rest of the script is just there to handle config + saving the mask results)
  2. The .generate(...) method internally calls a ._generate_masks(...) method, which itself calls a ._process_crop(...) method...
  3. Inside the ._process_crop(...) method, you can see that the .set_image(...) method is called on an instance of the SamPredictor, the same way you'd use the 'normal' predictor
  4. The points used by the automatic mask generator are separated into batches and passed into a ._process_batch(...) method...
  5. Then finally, inside the ._process_batch(...) method, the predictor is used (specifically the .predict_torch(...) method) to generate masks for each of the points passed into the function, where each point is treated as it's own individual foreground prompt.

So ultimately the automatic mask generator is just the normal predictor called in a loop (and in batches), but with a bunch of additional pre- and post-processing of the data. If you were to manually run the predictor on the same grid of 100 points, the results would (generally) be different from what you get out of the automatic mask generator, due to the pre-/post-processing steps.

As for why the 'anything' approach ends up being slower than the 'everything' approach (which I assume you mean running the predictor over 100 points in a loop vs. using the automatic mask generator with 100 points?), I'd guess that's due to the use of batching inside the amg code, so there's less time spent communicating between the CPU and GPU. Though there could be other factors, depending on how the non-amg code is implemented.

Colezwhy commented 1 year ago

The automatic mask generator uses the 'normal' predictor internally. It's a bit of a mess to follow, but the (slightly simplified) sequence is:

  1. When you run the automatic mask generator (amg.py) script, it instantiates the SamAutomaticMaskGenerator class, which has a .generate(...) method on it, which is used to create all of the masks. (This is really all the amg.py script does, the rest of the script is just there to handle config + saving the mask results)
  2. The .generate(...) method internally calls a ._generate_masks(...) method, which itself calls a ._process_crop(...) method...
  3. Inside the ._process_crop(...) method, you can see that the .set_image(...) method is called on an instance of the SamPredictor, the same way you'd use the 'normal' predictor
  4. The points used by the automatic mask generator are separated into batches and passed into a ._process_batch(...) method...
  5. Then finally, inside the ._process_batch(...) method, the predictor is used (specifically the .predict_torch(...) method) to generate masks for each of the points passed into the function, where each point is treated as it's own individual foreground prompt.

So ultimately the automatic mask generator is just the normal predictor called in a loop (and in batches), but with a bunch of additional pre- and post-processing of the data. If you were to manually run the predictor on the same grid of 100 points, the results would (generally) be different from what you get out of the automatic mask generator, due to the pre-/post-processing steps.

As for why the 'anything' approach ends up being slower than the 'everything' approach (which I assume you mean running the predictor over 100 points in a loop vs. using the automatic mask generator with 100 points?), I'd guess that's due to the use of batching inside the amg code, so there's less time spent communicating between the CPU and GPU. Though there could be other factors, depending on how the non-amg code is implemented.

TY! and i will close this issue.