MAD (CVPR 2023)

Modality-Agnostic Debiasing for Single Domain Generalization, CVPR 2023. [paper]

Institution & Author

Tongji University
JD.com

Preliminaries

Domain Adaptation (DA)
- 使用有标签的源域数据和无标签的目标域数据以transductive learning的方式对目标域数据进行泛化
- 特点：与特定目标域耦合，因此无法泛化到其他任意目标域上
Domain Generalization (DG)
- 使用多个源域数据训练一个泛化能力强的模型，目标是对任意目标域具备良好的泛化性
- 特点：训练过程不需要访问目标域数据，但学习难度比DA大
Single Domain Generalization (SDG)
- 只有一个源域数据集，要求从单一源域中学习到具备泛化能力的模型
- 特点：属于DG的一种极端情况，学习难度最大

Motivation

2023-06-09_155807

目前做SDG的方法大多是设计各种数据增强方法，将单一域的数据增强到不同域，以提高模型泛化性。但是，作者认为这些方法都是 modality-specific 的，并且大多只适用于图像领域。也就是说，这些方法关注的是数据增强方法，将单一图像域增强到多种图像域。但是若输入的是点云数据，点云数据具有和图像不同的domain shift规则（图像的domain shift一般是纹理、结构，而点云数据一般是几何结构和位置），那么这些数据增强方法是不适用于点云数据的。因此，作者从网络结构设计的角度来解决SDG问题，实现和数据模态无关的SDG方法。

Our motivation is straightforward: since the vanilla classifier trained with SGD will inadvertently focus more on those domain-specific features, the weights of the trained classifier can be considered as an indicator of those features.（作者的想法：既然classifier不可避免地会关注domain-specific feature，那么关注了这些bias的classifier就可以作为bias的一种indicator。此外，单个classifier无法定位所有的bias feature，因此提出使用多个分类器共同组成一个专门学习bias的分支。）

Contribution

提出 Modality-Agnostic Debiasing (MAD) framework，去除模型分类器学到的域偏见特征，关注更多的域泛化特征
- 架构：基础特征提取器 + 双分支分类器
- 第一个分支：Biased-branch，包含多头分类器，用于识别domain-specific feature
- 第二个分支：General-branch，通过去除biased-branch获得的bias feature，获取domain-generalized feature
在1D文本分类、2D图像分类、3D点云分类、语义分割任务上测试了MAD的单域泛化能力

Method

2023-06-09_104116

Module
- Input data -> Feature extractor -> [Feature]
- [Feature] -> Biased-branch Multi-head Classifier $g{\text{bias}}$ -> [multi logits] -> max([multi logits]) -> softmax -> $L{C-CE}$([prediction], Y)
- [Feature] -> General-branch Classifier $g{\text{gen}}$ -> [logits] -> softmax -> $L{CE}$
- $L{reg}$ 用于增大 $g{\text{bias}}$ 和 $g_{\text{gen}}$ 权重之间的距离
Pipeline
- stage 1: $L{C-CE} + L{reg}$，目的是先让 $g_{\text{gen}}$ 的权重脱离 domain-specific 范围，训练起来更容易
- stage 2: $L{C-CE} + L{CE} + L_{reg}$，在stage 1的前提下3个损失函数共同优化

For images, there are several factors typically correlated to domain-specific features, such as the background contexts [1], the texture of the objects [19], and high-frequency patterns that are almost invisible to the human eye [55].（图像中domain-specific特征通常有多个因素，包括背景，纹理和人眼不可见的高频信号，因此作者认为只用一个classifier来指示bias是不全面的）

如何识别 domain-specific feautre？
- multi-head classifier：使用多个分类器对特征提取器提取的feature进行映射。其中，每个分类器都会对feautre生成一个长度为k的logits向量（k=类别个数），表示在每个类别上的概率
- 不强制每个分类头能够单独对样本做出正确的分类，而是将所有分类头的logits向量在每个元素上取最大值（ $max(\cdot)$ ），用合成的logits对样本做预测。这样做是为了让不同的domain-specific classifier 之间进行合作，因为每个单独的classifier学到的都可能是bias，不具备分类普适性。而将所有classifier的结果合成起来，每个数据点取最大值，只要有一个classifier能分类正确就可以了
- 对合成的logits，使用 Cooperation CE Loss 进行监督，就是普通的交叉熵损失。注意 $max(\cdot)$ 操作由于不可微，因此在计算时使用log-sum-exp进行估计。

疑问：

怎么知道每个分类器都会定位不同的domain-specific特征？
- 没有解释
使用多少个分类器才能概括所有域特定特征？
- 作者提到分类器个数 $M$ 应在 $[1, D//K-1]$ 范围内, $D$, $K$ 分别为特征维度和类别个数，在实验中对 $M$ 的选择进行了验证

Based on the proposed biased-branch, we have an indicator to those domain-specific features. A follow-up question is how to suppress those domain-specific features in favor of focusing more on those desired domain-generalized features.（多头分类器学习完毕后，现在相当于有了对 domain-specific feautre 的指示器，接下来就是去除这些 bias。）

如何去除 domain-specific feature？
- general-branch classifier：使用单个分类器获取具有域泛化性的特征
- 设 $W{bias}$ 和 $W{gen}$ 分别为multi-head classifier和general-branch classifier的权重，一个直接的想法就是让和 $W{gen}$ 与 $W{bias}$ 正交，即拉大它们之间的距离，也就是最小化公式（3）中的F范数，这里的F范数衡量了 $W{bias}$ 和 $W{gen}$ 的相似性

However, if we optimize the whole network (including the feature extractor f, biased-branch classifier $g{bias}$, and general-branch classifier $g{gen}$ simultaneously at the beginning, there is no guarantee that the classifier $g{gen}$ will pay more attention to those domain-general features. To address this issue, we introduce a two-stage learning mechanism to enable the interaction between the two branches.（若直接对3个损失函数进行联合优化，无法保证 $g{gen}$ 能更多地关注域泛化特征，因此作者提出两阶段训练策略）

优化过程
- stage 1：仅优化公式（1）和（3），目的是让multi-head classifier学到domain-specific feature，同时让 $g{gen}$ 的权重远离 $g{bias}$
- stage 2：联合优化公式（1）（2）（3），保证了最小化（3）的同时，让（2）优化 $g_{gen}$ 对样本做出正确预测
- 公式（4）为总优化目标，其中 $pro>T$ 标识了开始 stage 2的阈值, $T$的选择基于训练集大小和任务难度

Experiments

评估方案
- single-DG泛化性评估（Table 1~5）
  - 图像识别（分类）：PACS和VLSC数据集，其中PACS的域差异主要来自风格差异，VLSC的域差异主要来自背景和视角的变化
  - 点云识别：PointDA-10数据集
  - 文本分类：Amazon Reviews数据集
  - 语义分割：GTA-5 -> Cityscapes
- 消融实验（Table 6）
  - one-stage：不进行stage 1预训练，直接同时优化三个目标函数
  - single-head：在bias branch仅使用一个分类器学习bias
  - Table 6证明了two-stage训练策略和multi-head classifier都是关键因素
- 在文本分类任务上超参数分析（Figure 3）
  - M：multi-head classifier，M为分类器的个数
  - T：stage 2的开启阈值
- 低频和高频验证（这个实验之前没看到过，Figure 4）
  - 目的：验证MAD是否鼓励分类器更多地关注低频信息（域泛化信息）
  - 方案：将VLCS数据集中的LabelMe域数据通过傅里叶变换和逆傅里叶变换（阈值为$r$）分解为低频和高频信息，然后将训练好的MAD在低频和高频信息上分别做评估
  - 结果：选择$r$=12/16两种阈值，Figure 4显示出MAD在低频信息上比baseline效果更好（虚线为高频，实线为低频）

As pointed out in [55], the low-frequency component (LFC) is much more generalizable than high-frequency component (HFC), i.e., LFC typically represents those domain-generalized (semantic) features, and HFC denotes those domain-specific (superficial) features.（低频通常对应域泛化特征，高频通常对应域特定特征）

文中提到MAD能够用在基于数据增强的SDG方法中，因为multi-head branch仅用于训练一个好的general classifier，在推理阶段直接丢弃掉，因此将训练好的general classifier直接用于其他方法中，也能带来提升。这一结论的对应的实验：
- Table 1：在PACS上测试单域泛化精度，PACS分别表示四个源域，对应的数值表示以该域作为源域训练，以其余三个作为目标域测试得到的分类精度
- Table 2：在VLCS上测试单域泛化精度
- Table 3：在用于DA任务的点云数据集PointDA上测试对点云分类任务的单域泛化性
- Table 4：在Amazon-Review数据集上测试对文本分类任务的单域泛化性
- Table 5：语义分割单域泛化评估，GTA-5 -> Cityscapes

chaos-moon / paper_daily

[paper] Modality-Agnostic Debiasing for Single Domain Generalization #21