The official implementation of the paper:
Referring Video Object Segmentation (R-VOS) methods face challenges in maintaining consistent object segmentation due to temporal context variability and the presence of other visually similar objects. We propose an end-to-end R-VOS paradigm that explicitly models temporal instance consistency alongside the referring segmentation. Furthermore, we propose a new Mask Consistency Score (MCS) metric to evaluate the temporal consistency of video segmentation. Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin, leading to top-ranked performance on popular R-VOS benchmarks.
https://github.com/bo-miao/HTR/assets/53172019/7b2e7d56-59f8-4ba2-b502-c4e7ed9e0417
Please refer to SgMg for installation and data preparation.
The checkpoint for HTR w/ SwinL is available at HTR-SwinL.
If you want to evaluate HTR on Ref-DAVIS/YouTube-VOS, please run the following command in the scripts
folder:
sh dist_test_davis_swinl.sh
sh dist_test_ytv_swinl.sh
The code for MCS evaluation is in get_mcs.py
.
Please click View scoring output log
to download stdout.txt
of your submission in Ref-YTVOS eval server.
Then you can run the script to get the MCS score under different thresholds.
@article{miao2024htr,
title={Towards Temporally Consistent Referring Video Object Segmentation},
author={Miao, Bo and Bennamoun, Mohammed and Gao, Yongsheng and Shah, Mubarak and Mian, Ajmal},
journal={https://arxiv.org/abs/2403.19407},
year={2024}
}
If you have any questions about this project, please feel free to contact bomiaobbb@gmail.com.