by Long Lian, Zhirong Wu and Stella X. Yu at UC Berkeley, MSRA, and UMich
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
[Paper] | [Project Page] | [Presentation Video] | [Demo Video] | [Poster] | [Citation]
This GIF has been significantly compressed. Check out our video at full resolution here. Inference in this demo is done per-frame without post-processing for temporal consistency.
If you want to qualitatively compare segmentation masks from our method without running our code, you can download the segmentation masks here.
Download DAVIS 2016 and unzip to data/data_davis
.
Download pre-extracted flow from RAFT (trained with chairs and things) and decompress to data/data_davis
.
Download DenseCL ResNet50 weights to data/pretrained/densecl_r50_imagenet_200ep.pth
.
torchCRF
The requirements.txt
assumes CUDA 11.x. You can also install torch and torchvision from conda instead of pip.
torchCRF
is a GPU CRF implementation typically faster than CPU implementation. If you plan to use this implementation in your work, see the tools/torchCRF/README.md
for license.
pip install -r requirements.txt
cd tools/torchCRF
python setup.py install
We also require parallel
command from moreutils. If your parallel does not work (for example, the parallel from parallel package), you either need to install moreutils from system package manager (e.g. APT on Ubuntu/Debian) or from conda: conda install -c conda-forge moreutils
.
We provide pretrained models and prediction masks. If you intend to work on a custom dataset that is out-of-distrbution for our training data such as DAVIS16, we suggest training/fine-tuning our model on new datasets.
Name | Dataset | Backbone | mIoU (w/o pp.) | mIoU (w/ pp.) | Model | Masks |
---|---|---|---|---|---|---|
RCF (All stages) | DAVIS16 | ResNet50 | 80.9 | 83.0 | Download | Download |
RCF (Stage 1 only) | DAVIS16 | ResNet50 | 78.9 | 81.4 | Download | Download |
RCF (All stages) | SegTrackv2 | ResNet50 | 76.7 | 79.6 | Download | Download |
RCF (Stage 1 only) | SegTrackv2 | ResNet50 | 72.8 | 77.6 | Download | Download |
RCF (All stages) | FBMS59 | ResNet50 | 69.9 | 72.4 | Download | Download |
RCF (Stage 1 only) | FBMS59 | ResNet50 | 66.8 | 69.1 | Download | Download |
To evaluate a pretrained model using our unofficial main training script and/or the masks for evaluation using evaluation tools, use --test-override-pretrained
and --test-override-object-channel
to specify the model path and the object channel, respectively.
To train our model on DAVIS16 with 2 GPUs, run:
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --master_addr 127.0.0.1 --master_port 9000 --nproc_per_node gpu main.py configs/rcf/rcf_stage1.yaml
This should lead to a model with mIoU around 78% to 79% on DAVIS16 (without post-processing). Run stage 2 as well if additional gains are desired. If you want to run with other numbers of GPUs, change CUDA_VISIBLE_DEVICES
and the batch_size
so that the total batch size (batch_size
times the number of GPUs) is your intended batch size (16 in this config).
This stage uses Conditional Random Field (CRF) to get training signals based on low-level vision (e.g., color). Prior to running this stage, we need to get the object channel through motion-appearance alignment.
CUDA_VISIBLE_DEVICES=0 python tools/SemanticConstraintsAndMAA/maa.py --pretrain_dir saved/saved_rcf_stage1 --first-frames-only --step 43200
OBJECT_CHANNEL=$?
Then we could run training (which will continue training from pretrained stage 1 model):
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --master_addr 127.0.0.1 --master_port 9000 --nproc_per_node gpu main.py configs/rcf/rcf_stage2.1.yaml --opts object_channel $OBJECT_CHANNEL
This stage uses a pretrained ViT model from DINO to get training signals based on high-level vision (e.g., semantics discovered in unsupervised learning). Semantic constraints are enforced offline due to its low speed.
# the predictions on trainval
CUDA_VISIBLE_DEVICES=0 python main.py configs/rcf/rcf_export_trainval_ema.yaml --test --test-override-pretrained saved/saved_rcf_stage2.1/last.ckpt --opts checkpoints_dir saved/saved_rcf_stage2.1 object_channel $OBJECT_CHANNEL
# run semantic constraints
CUDA_VISIBLE_DEVICES=0 python tools/SemanticConstraintsAndMAA/semantic_constraints.py --pretrain_dir saved/saved_rcf_stage2.1 --object-channel $OBJECT_CHANNEL
# training with semantic constraints
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --master_addr 127.0.0.1 --master_port 9000 --nproc_per_node gpu main.py configs/rcf/rcf_stage2.2.yaml --opts object_channel $OBJECT_CHANNEL train_dataset_kwargs.pl_root saved/saved_rcf_stage2.1/saved_eval_export_trainval_ema_torchcrf_ncut_torchcrf/$OBJECT_CHANNEL
This should give you a 80% to 81% mIoU (without post-processing).
To unofficially evaluate a trained model, run:
CUDA_VISIBLE_DEVICES=0 python main.py configs/rcf/rcf_eval.yaml --test --test-override-pretrained saved/saved_rcf_stage2.2/last.ckpt --test-override-object-channel $OBJECT_CHANNEL
We encourage evaluating the model with our evaluation tool, which is supposed to closely match the DAVIS 2016 official evaluation tool. To evaluate a trained model with the evaluation tool on the exported masks (stage 2.2 will masks on validation set by default):
python tools/davis2016-evaluation/evaluation_method.py --task unsupervised --davis_path data/data_davis --year 2016 --step 4320 --results_path saved/saved_rcf_stage2.2/saved_eval_export
To refine the exported masks from a trained model with CRF post-processing, run:
sh tools/pydenseCRF/crf_parallel.sh
Then evaluate the refined masks with evaluation tool:
python tools/davis2016-evaluation/evaluation_method.py --task unsupervised --davis_path data/data_davis --year 2016 --step 4320 --results_path saved/saved_rcf_stage2.2/saved_eval_export_crf
This should reproduce around 83% mIoU (J-FrameMean).
This repo also supports training AMD. However, this implementation is not guaranteed to be identical to the original one. In our experience, it reproduces results that are slightly better than the original reported results without test-time adaptation (i.e., fine-tuning on the downstream data). The setup for training set (Youtube-VOS) is simply unzipping the train_all_frames.zip
from Youtube-VOS to data/youtube-vos/train_all_frames
. The setup for validation sets are the same as RCF.
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --master_addr 127.0.0.1 --master_port 9000 --nproc_per_node gpu main.py configs/amd/amd.yaml
If you have any questions on the paper or this implementation, please contact Long Lian using the email address in the paper.
Please give us a star 🌟 on Github to support us!
Please cite our work if you find our work inspiring or use our code in your work:
@InProceedings{Lian_2023_CVPR,
author = {Lian, Long and Wu, Zhirong and Yu, Stella X.},
title = {Bootstrapping Objectness From Videos by Relaxed Common Fate and Visual Grouping},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {14582-14591}
}
@article{lian2022improving,
title={Improving Unsupervised Video Object Segmentation with Motion-Appearance Synergy},
author={Lian, Long and Wu, Zhirong and Yu, Stella X},
journal={arXiv preprint arXiv:2212.08816},
year={2022}
}