Closed davidiommi closed 3 years ago
Hi @davidiommi , thanks for your interest here. @ahatamiz , could you please help share some comments about this question?
Thanks in advance.
Hi @davidiommi ,
Thank you for your interest in our work. This is a great question regarding the different hyper-parameters that are used in UNETR. Let me first provide a short answer: feature_size
and pos_embed
are the parameters that need to changed to adopt it for your application of interest. Other parameters that are mentioned come from Vision Transformer (ViT) default hyper-parameters (original architecture). In addition, the new revision of UNETR paper with more descriptions is now publicly available. Please check for more details:
https://arxiv.org/pdf/2103.10504.pdf
Now let's look at each of these hyper-parameters in the order of importance:
feature_size
: In UNETR, we multiply the size of the CNN-based features in the decoder by a factor of 2 at every resolution ( just like the original UNet paper). By default, we set this value to 16 ( to make the entire network lighter). However using larger values such as 32 can improve the segmentation performance if GPU memory is not an issue. Figure2 of the paper also shows this in details.
pos_embed
: this determines how the image is divided into non-overlapping patches. Essentially, there are 2 ways to achieve this ( by setting it to conv
or perceptron
). Let's further dive into it for more information:
First is by directly applying a convolutional layer with the same stride and kernel size of the patch size and with feature size of the hidden size in the ViT model. Second is by first breaking the image into patches by properly resizing the tensor ( for which we use einops) and then feed it into a perceptron (linear) layer with a hidden size of the ViT model. Our experiments show that for certain applications such as brain segmentation with multiple modalities (e.g. 4 modes such as T1,T2 etc.), using the convolutional layer works better as it takes into account all modes concurrently. For CT images ( e.g. BTCV multi-organ segmentation), we did not see any difference in terms of performance between these two approaches.
hidden_size
: this is the size of the hidden layers in the ViT encoder. We follow the original ViT model and set this value to 768. In addition, the hidden size should be divisible by the number of attention heads in the ViT model.
num_heads
: in the multi-headed self-attention block, this is the number of attention heads. Following the ViT architecture, we set it to 12.
mlp_dim
: this is the dimension of the multi-layer perceptrons (MLP) in the transformer encoder. Again, we follow the ViT model and set this to 3072 as default value to be consistent with their architecture.
Lastly I recommend checking UNETR repository for contains the code for reproducing BTCV experiments: https://monai.io/research/unetr
I hope these descriptions were helpful.
Thanks for your reply.
I am testing the network on other segmentation tasks (prostate) but performances are worse than the dynet/nnUnet. After reading more about visual transformers I understand that I would need much more data compared to the Cnns.
We will see in the future is UneTR will outperform CNN with medical images.
Thanks again.
Hi @davidiommi
Great. I think it may also need more tuning/training based on your specific task. On task 6 of MSD dataset ( prostate ) we have already outperformed nnUnet on our internal validation folds.
Thanks.
First of all: nice work UneTR and all the best for the review process.
I have one question: are there any indications on how to regulate the parameters of the network depending on the dataset we are working on?
feature_size (int) – dimension of network feature size. hidden_size (int) – dimension of hidden layer. mlp_dim (int) – dimension of feedforward layer. num_heads (int) – number of attention heads. pos_embed (str) – position embedding layer type.
Thanks in advance