Sik-Ho Tsang | Review -- UNETR: Transformers for 3D Medical Image Segmentation.

NorbertZheng commented 1 year ago

Sik-Ho Tsang. Review — UNETR: Transformers for 3D Medical Image Segmentation.

NorbertZheng commented 1 year ago

Overview

UNETR, ViT as Encoder, CNN as Decoder UNETR consists of a transformer encoder that directly utilizes 3D patches and is connected to a CNN-based decoder via skip connection.

UNETR: Transformers for 3D Medical Image Segmentation, UNETR, by NVIDIA, and Vanderbilt University, 2022 WACV, Over 340 Citations. Medical Imaging, Medical Image Analysis, Image Segmentation, U-Net, Transformer, Vision Transformer, ViT

Biomedical Image Segmentation 2015 … 2021 [Expanded U-Net] [3-D RU-Net] [nnU-Net] [TransUNet] [CoTr] [TransBTS] [Swin-Unet] ==== My Other Paper Readings Also Over Here ====

The task of volumetric (3D) medical image segmentation is reformulated as a sequence-to-sequence prediction problem.
UNEt TRansformers (UNETR) is introduced that utilizes a Transformer as the encoder to learn sequence representations of the input volume and effectively capture the global multi-scale information, while also following the “U-shaped” network design for the encoder and decoder.

NorbertZheng commented 1 year ago

UNEt TRansformers (UNETR)

3D ViT as Backbone

A $1D$ sequence of a $3D$ input volume $x$ of size $H\times W\times D\times C$ is created with resolution $(H, W, D)$ and $C$ input channels by dividing it into flattened uniform non-overlapping patches $x_{v}$ of $N\times(P^{3}C)$, where $(P, P, P)$ denotes the resolution of each patch and $N=\frac{(H\times W\times D)}{P^{3}}$ is the length of the sequence. Patch resolution $P=16$.
Subsequently, a linear layer is used to project the patches into a $K$ dimensional embedding space.
A $1D$ learnable positional embedding $E_{pos}$ of size $N\times K$ is added to the projected patch embedding, where the projection matrix $E$ is of size $(P^{3}C)\times K$:
The learnable [class] token is not added. Embedding size $K=768$.

NorbertZheng commented 1 year ago

A stack of Transformer blocks is used, which comprises of multi-head self-attention (MSA) and multilayer perceptron (MLP) sublayers: where $Norm()$ denotes layer normalization, MLP comprises of two linear layers with GELU activation functions, $i$ is the intermediate block identifier, and $L$ is the number of Transformer layers.
The Transformer-based encoder follows the ViT-B16, with $L=12$.
A MSA sublayer comprises of $n$ parallel self-attention (SA) heads. The attention weights (A) are computed by measuring the similarity between two elements in $z$ and their key-value pairs: where $W_{msa}$ represents the multi-headed trainable parameter weights.

NorbertZheng commented 1 year ago

U-Net-Like Encoder Decoder Architecture

Similar to U-Net, features from multiple resolutions of the encoder are merged with the decoder, a sequence representation $z_{i}$ $(i\in {3,6,9,12})$ is extracted, with size $\frac{H\times W\times D}{P^{3}}\times K$, i.e. $N\times K$.
At each resolution, the reshaped tensors, e.g.
- $\frac{H}{16}\times\frac{W}{16}\times\frac{D}{16}\times 768$ to $\frac{H}{2}\times\frac{W}{2}\times\frac{D}{2}\times 128$,
- $\frac{H}{16}\times\frac{W}{16}\times\frac{D}{16}\times 768$ to $\frac{H}{4}\times\frac{W}{4}\times\frac{D}{4}\times 256$,
- $\frac{H}{16}\times\frac{W}{16}\times\frac{D}{16}\times 768$ to $\frac{H}{8}\times\frac{W}{8}\times\frac{D}{8}\times 512$,
are projected from the embedding space into the input space by utilizing consecutive $3\times 3\times 3$ convolutional layers that are followed by batch normalization layers.
At the bottleneck of the encoder, a deconvolutional layer is applied to the transformed feature map to increase its resolution by a factor of $2$.
The resized feature map is concatenated with the feature map of the previous Transformer output (e.g. $z_{9}$), and fed into consecutive $3\times 3\times 3$ convolutional layers and the output is upsampled using a deconvolutional layer. This process is repeated for all the other subsequent layers up to the original input resolution.
The final output is fed into a $1\times 1\times 1$ convolutional layer with a softmax activation function to generate voxel-wise semantic predictions.

NorbertZheng commented 1 year ago

Loss Function

The loss function is a combination of soft dice loss and cross-entropy loss:

NorbertZheng commented 1 year ago

Results

BTCV

Quantitative comparisons of segmentation performance in BTCV test set. Top and bottom sections represent the benchmarks of Standard and Free Competitions respectively.

UNETR outperforms the state-of-the-art methods for both Standard and Free Competitions on the BTCV leaderboard.

Qualitative comparison of different baselines in BTCV cross-validation.

UNETR shows improved segmentation performance for abdomen organs.

NorbertZheng commented 1 year ago

MSD

Quantitative comparisons of the segmentation performance in brain tumor and spleen segmentation tasks of the MSD dataset.

For brain segmentation, UNETR outperforms the closest baseline by 1.5% on average over all semantic classes. In particular, UNETR performs considerably better in segmenting tumor core (TC) sub-region.
Similarly for spleen segmentation, UNETR outperforms the best competing methodology CoTr by least 1.0% in terms of Dice score.

UNETR demonstrates better performance in capturing the fine-grained details of tumors.

NorbertZheng commented 1 year ago

Ablation Studies

Effect of the decoder architecture on segmentation performance. NUP, PUP and MLA denote Naive UpSampling, Progressive UpSampling and Multi-scale Aggregation.

The encoder of UNETR is still used but the decoder is replaced with 3D counterparts of Naive UPsampling (NUP), Progressive UPsampling (PUP) and MuLti-scale Aggregation (MLA) from SETR.
Yet, these decoder architectures yield sub-optimal performance. UNETR outperforms MLA, PUP and NUP by 1.4%, 2.3% and 3.2%.

Effect of patch resolution on segmentation performance.

Decreasing the patch resolution from 32 to 16 improves the performance by 1.1% and 0.8% in terms of average Dice score in spleen and brain segmentation tasks respectively.

Comparison of number of parameters, FLOPs and averaged inference time for various models in BTCV experiments.

UNETR is a moderate-sized model with 92.58M parameters and 41.19G FLOPs.
UNETR outperforms these CNN-based models while having a moderate model complexity. UNETR has the second lowest averaged inference time after nnUNet and is significantly faster than Transformer-based models such as SETR, TransUNet and CoTr.

NorbertZheng / read-papers