Segment Anything

official repo: https://github.com/facebookresearch/segment-anything

Introduction

GPT기반의 NLP 모델들이 최근에 성공 할 수 있었던 요인은 아래 세가지의 의문을 해결했기 때문

What task will enable zero-shot generalization?
What is the corresponding model architecture?
What data can power this task and model?
Task
- NLP에서 근본이 되는 모델은 zero-shot learning을 가능하기 위해서 prompting 기술을 사용
- 유효한 segmentation task를 반환해주는 promptable segmentation task를 제안
- prompt는 image와 segment의 정보 뿐만 아니라 space, text 정보까지 포함
Model
- promptable segmentation task를 수행하는 model은 아래 조건을 만족해야함
- flexible prompt, mask를 real-time interactive 환경에서 계산, ambiguity-aware를 만족해야함
- SAM에서는 prompt encoder와 image encoder를 분리하고 이를 merge하는 lightweight mask decoder를 생성함
- ambiguity-aware를 위해서 여러개의 mask를 예측하여 SAM이 자연스럽게 ambiguity를 다룰 수 있게 함(shrit vs person)
Data
- 현재 존재하는 dataset으로는 원하는 수준을 얻지 못함
- Dataset을 만들기 위해 data engine을 설계했으며, assisted-manual, semi-automatic, and fully automatic 스텝으로 나뒴
- assisted-manual: 일반적으로 알고 있는 오토라벨링 후 인간이 후처리
- semi-automatic: object의 부분 집합을 위치 정보를 담은 prompting에 의해 마스크를 얻어냄. annotators는 분류가 되지 않은 것들을 annotation함
- 우리(we) SAM을 prompt하여 전경들의 keypoint의 grid를 얻어냄
- SA-1B를 얻어냄
- 1B의 마스크 11M개의 라이센스와 개인보장 이미지들.
  
  zero-shot? zero-shot learning은 한 번도 관측되지 않은 클래스에 대한 분류를 학습 meta-learning 의 한 갈래

Segment Anything Task

Task

NLP에 영향을 받아 promptable segmenation task를 지정하고, prompt는 fore/back ground points, rough box, free-form text등이 될 수 있다. 이 prompt의 결과로 유효한 마스크를 반환해준다. 여기서 "유요한"이라는 말의 뜻은 segment가 모호하거나 여러개의 오브젝트를 가리키더라도, 오브젝트들 중에 최소한 하나의 합리적인 마스크에 속할 수 있다는 의미를 갖는다.

Pre-training

이 promptable segmenation task는 선택한 이유는 이 방식이(NLP에서 검증 되었듯) 가장 자연스러운 pre-training 방법이고 zero-shot transder를 위한 일반적인 방법론이기 때문이다. Prompting으로 segmentation을 수행하는 것은 기존 interactive segmentation과 비슷한데, 가장 궁극적인 차이점은 심지어 결과가 모호할지라도 어떠한 prompt에도 valid mask를 준다는 것에서 다르다.

Segment Anything Model

MAE pre-trained ViT를 Image Encoder로 사용합니다. Prompt에는 points, box, text, grid(points)가 들어 갈 수 있습니다. 이때 points와 box는 positional embedding을 CLIP에서 제시한 방식을 활용합니다. 마지막으로 grid에서는 convolution과 embedding을 적절히 활용합니다. Mask decoder는 위의 두개의 encoder의 output을 mask로 map하는 역할을 수행합니다.

각각의 모듈에 referece는 논문 참조

Segment Anything Data Engine

앞서 이야기 한 바와 같이 data가 충분하지 않다고 느껴왔기 떄문에 data engine을 새로 디자인해서 1.1B개의 mask 데이터셋을 생성했습니다. Assisted-manual stage -> Semi-automatic stage -> Fully automatic stage 으로 engine은 구성됩니다.

Zero-Shot Transfer Experiments

SAM으로 수행한 몇가지 zero-shot transfer 실험들을 소개

D.2. Zero-Shot Edge Detection

16X16의 그리드를 prompt에 넣고 SAM을 돌려서 768개의 predicted masks를 얻습니다. 그런 다음 NMS 알고리즘을 통해서 불필요한 마스크를 제거하고, 소벨 필터를 적용합니다. 마지막으로 pixel-wise max operation을 통해서 위의 사진과 같은 결과를 얻을 수 있습니다.

D.4. Zero-Shot Instance Segmentation

COCO 및 LVIS v1 검증 분할에서 완전히 감독된 ViTDet-H를 모델의 출력으로 생기는 bbox를 SAM에 프롬프트를 표시합니다. 우리는 최종 예측을 생성하기 위해 상자 프롬프트와 함께 가장 자신 있는 예측 마스크를 마스크 디코더에 다시 공급하여 추가적인 마스크 개선 반복을 적용합니다.

고찰

In the future, SAM could be used to help power applications in numerous domains that require finding and segmenting any object in any image. For the AI research community and others, SAM could become a component in larger AI systems for more general multimodal understanding of the world, for example, understanding both the visual and text content of a webpage. In the AR/VR domain, SAM could enable selecting an object based on a user’s gaze and then “lifting” it into 3D. For content creators, SAM can improve creative applications such as extracting image regions for collages or video editing. SAM could also be used to aid scientific study of natural occurrences on Earth or even in space, for example, by localizing animals or objects to study and track in video. We believe the possibilities are broad, and we are excited by the many potential use cases we haven’t even imagined yet.

Jungduri / MLPaperReivew

Segment Anything #5