Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation

official repo: https://github.com/IDEA-Research/ED-Pose

Abstract

Explicit box Detection multi-person pose estimation: ED-Pose. 명시적이라는 키워드에 주목해서 볼 필요가 있다. 아래와 같은 키워드가 논문에서 제시하는 가장 주요한 포인트이다.

Human detection decoder
Keypoint 간의 contextual learning; box position과 contents를 학습하기 위함
Human-to-keypoint detection은 interactive learning을 통해 global과 local feature들이 잘 aggregation 될 수 있게 함 종합해보면, ED-Pose는 pose-processing 과 조밀한 heatmap 없이 동작한다.

Introduction

Figure 1에 있는 그림과 같이 Pose estimation은 1. global(Human-level), local(keypoint-level)의 의존성을 가지고 이러한 이유로 two-stage subtasks로 분리된 두가지 문제를 푸는데 집중하는 경향이 있다(e.g. global person detection, keypoint regression).

Top-Down(TD)방식과 Bottom-Up(BU)이로 나뉨.
- Top-Down(TD)방식: 정확도는 좋지만 많은 추론 비용을 야기함.
- Bottom-Up(BU) 방식: 빠른 인퍼런스 속도와 낮은 정확도.
이런 two-stages 접근 방식은 각 stages의 추론 결과를 하나는 합치는 과정에서 미분불가능한 연산이 나오기 때문에 합치기가 어려움.
최근 PETR과 같은 post-processing이 전혀 존재 하지 않는 transformer계영의 detection task에서 영감을 가져옴

Rethinking one-stage multi-person pose estimation

The Necessities of One-stage Methods

TD는 global에서 object detection을 통해 사람을 검출 및 cropping하고, local level에서 keypoint를 추정하는 방식으로 진행됌. 이러한 방식은 아래와 같은 문제점을 갖고 있음.

detector의 강력한 의존성을 갖고 있음.
detector와 후처리 로직(RoI, NMS 등)의 cost
detector와 pose estimator의 독립적인 학습 반면, BD는 검출할 수 있는 모든 keypoint를 검출 한 다음에 grouping algorithm을 통해 같은 사람의 keypoint를 연결함.
heavy occlusion(특히 multi-person)이 있는 경우에 grouping algorithm이 잘 동작하지 않을 수 있음 물론 두가지 방식 다 미분불가능하다는 단점을 가지고 있다.

one-stage 방식은 위에서 언급한 모든 단점을 경감시킬 수 있고 end-to-end 방식으로 최적화 할 수 있다는 장점을 갖고 있다. 최근 DETR에서 영감을 받은 방식이 one-stage로 해결하려는 노력을 했으나, 성능의 심각한 하락을 야기했다.

The Bottlenecks of Existing One-stage Methods

DETR 기반의 방식이라는 용어에서 대부분의 연구는 여전히 TD framework를 고수해왔고 두번째 사람의 pose estimation 결과를 sequential 정보를 추가하여 key points 결과를 좋게 만드는데 성공함. PETR는 모든 pipeline을 end-to-end로 어떠한 후처리 없이 성공함. 하지만 이러한 방식은 여전히 문제점을 가지고 있음.

key point를 regression하는데 있어 local 정보만 사용하여 의미적으로 모호한 정보가 추출 될 수 있음.
Feature의 사용 없이 랜덤하게 뽑혀진 keypoint 부터 query(TR계열이므로)로 학습하면 느리다.
Point로써 추출되는 keypoint는 인코딩된 feature에서 query할때 정보가 부족하여 오정렬될 가능성이 있음.
global-to-global, global-to- local, local-to-local들의 관계를 표현하는데 있어 상당히 복잡하다.

Methodology

Overview

input: image
F: tokenize(backbone(input))
PE: positional embeddings,
Encoder: transformer의 encoder 부분
- input: F
- output: F'
Human detection decoder(coarse human query selection): Q^c_H, Q^p_H으로부터 Q^c'_H, Q^p'_H를 생성하기 위함(후술)
- inputs
- Q^c_H: human content queries, F'으로 부터 coarse 하게 추출됨
- Q^p_H: human position queries, Q^c_H으로부터 FFN을 통해 생성
- outputs:
- Q^c'_H
- Q^p'_H
- 위의 output으로 box regression과 class entropy를 계산(L_h, L_c)
- Fine human query selection
- 필요없는 human queries를 버리고 Q^c_{Hs}, Q^p{H_s}를 얻음
- Human-to-Keypoint query expansion
- human의 정보로부터 keypoint의 정보로 확장함
- Q^c_{Hs}, Q^p{Hs}로부터 Q^c{H,K}, Q^p_{H_K}로 확장함
Human-to-Keypoint detection decoder
- inputs
- Q^c_{H,K}: 확장된 keypoint에 대한 contents queries
- Q^p_{H,K}: 확장된 keypoint에 대한 position queries
- outputs:
- Q^c'_{H,K}
- Q^p'_{H,K}
- 위의 결과를 바탕으로 keypoint, box, class loss를 계산함
Loss: keypoint는 L1와 OKS 로 구성(나머지는 언급X)

Human Detection Decoder

각 사람의 bounding box를 예측
- inputs
- Q^c_H: human content queries(NXD)
- Q^p_H: human position queries(NX4)
- outputs:
- Q^c'_H: refined human content representations
- Q^p'_H: refined human box positions
인풋이 human-to-human attention(self-attention)에 들어가고 mutti-scale과 cross-attention layer을 수행(idea from DETR)

Human-to-Keypoint Detection Decoder

multi-person pose estimation을 the multiple set keypoint box detection problems으로 치환하여 생각
Q^c_{Hs}, Q^p{Hs}로부터 Q^c{H,K}, Q^p_{H_K}로 확장함
- Ve (1xKxD), 를 임베딩 시켜서 Q^c{H,K}, Q^p_{H_K}로 확장함
- M ×1×D, M ×1×4 -> (M+M∗K)×D, (M+M∗K)×4

Experiments

Datasets: CrowdPose, COCO

Jungduri / MLPaperReivew