This repository contains the code to replicate the Video Object Segmentations benchmark of the VISOR dataset. It replicates the results of table 3 in our paper: EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations

Download pre-trained models

backbone training stage training dataset J&F J F weights
resnet-50 stage 1 MS-COCO 56.9 55.6 58.3 link
resnet-50 stage 2 MS-COCO -> VISOR 76.4 74.2 78.6 link




MS-COCO instance segmentation dataset is used to generate synthitic video out of 3 frames to train STM. This could be helpful as a pretraining stage before doing the main training on VISOR.



After pretrain on MS-COCO, we fine-tune on VISOR dataset by sample 3 frames from a sequence in each training iteration. To visualize VISOR dataset, you can check VISOR-VIS 00230

Dataset Structure

To run the training or evaluation scripts, the dataset format should be as follows (following DAVIS format), a script is given in the next step to convert VISOR to DAVIS-like dataset.

|- VISOR_2022
  |- val_data_mapping.json
  |- train_data_mapping.json
  |- JPEGImages
  |- Annotations
  |- ImageSets
     |- 2022
        |- train.txt
        |- val.txt
        |- val_unseen.txt

  |- train2017
  |- annotations
      |- instances_train2017.json

Where val.txt contains the set of seqeunces that belongs to the validation split and val_unseen.txt contains the subset of the validation split for the unseen kitchens. Also note that mapping files would be generated i.e. val_data_mapping.json that represent the object names and their corresponding mask color codes, this would be helpful to get any object name related stats and it would be used by Semi-Supervised Codalab if you wish to participate in EPIC-KITCHENS VISOR Semi-Supervised Video Object Segmentation Challenge.

VISOR to DAVIS-like format

To generate the required structure you have to download the VISOR train/val images and json files first , then you can run visor_to_davis.py script with the following parameters:

set: train or val, which is the split that you want to generate DAVIS-like dataset for.
keep_first_frame_masks_only: 0 or 1, this flag to keep all masks for each sequence or the masks in the first frame only, this flag usually 1 when generating val and 0 when generating train
visor_jsons_root: path to the json files of visor, the train and val folders should exists under this root directory as follows:

|- visor_jsons_root
   |- train
      |- P01_01.json
      |- PXX_(X)XX.json
   |- val
      |- P01_107.json
      |- PXX_(X)XX.json

images_root: path to the RGB images root directory. The images should be in the following structure:

|- images_root
   |- P01_01
      |- P01_01_frame_xxxxxxxxxx.jpg
   |- PXX_XXX
      |- PXX_(X)XX_frame_xxxxxxxxxx.jpg

output_directory: path to the directory where you want VISOR to be, a VISOR_2022 directory would be automatically created with DAVIS-like formatting.
output_resolution: resolution of the output images and masks, however, the VOS baseline tested on 480p which is the default value for this parameter.
This is sample run of the script to generate train and val with 480p resolution, you must run it twice, one to generate train and another one to generate val, note that the keep_first_frame_masks_only changes since you have to keep all masks in the training split unlike the validation where we have to keep the masks in the first frame only for proper evaluation:

To generate val:
python visor_to_davis.py -set val -keep_first_frame_masks_only 1  -visor_jsons_root . -images_root ../VISOR_Images/Images_fixed -output_directory ../out_data

To generate train:
python visor_to_davis.py -set train -keep_first_frame_masks_only 0  -visor_jsons_root . -images_root ../VISOR_Images/Images_fixed -output_directory ../out_data

The scripts also will create the txt files that should be in the DAVIS-like dataset structre. Also it creates mapping files under the output_directory to maps each colors in the images with the object name in VISOR for any object-related analysis.


Stage 1

To pretrain on MS-COCO, you can run the following script.

python train_coco.py -Dvisor "path to visor" -Dcoco "path to coco" -backbone "[resnet50,resnet18]" -save "path to save models"
python train_coco.py -Dvisor ../data/Davis/ -Dcoco ../data/Ms-COCO/ -backbone resnet50 -save ../coco_weights/

Stage 2

Main traning on VISOR, to get the best performance, you should resume from the MS-COCO pretrained model in Stage 1.

python train_stm_baseline.py -Dvisor "path to visor" -total_iter "total number of iterations" -test_iter "test every this number of iterations" -backbone "[resnet50,resnet18]" -wandb_logs "1 if you want to save the logs into your wandb account (offline)" -save "path to save models" -resume "path to coco pretrained weights"
python train_stm_baseline.py -Dvisor  ../VISOR_2022/ -total_iter 400000 -test_iter 40000 -batch 32 -backbone resnet50 -save ../visor_weights/ -name experiment1 -wandb_logs 0  -resume ../coco_weights/coco_res50.pth


Evaluating on VISOR based on DAVIS evaluation codes, we adjusted the codes to include the last frame of the sequence in our scores

python eval.py -g "gpu id" -s "set" -y "year" -D "path to visor" -p "path to weights" -backbone "[resnet50,resnet18,resnest101]"
python eval.py -g 0 -s val -y 22 -D ../data/VISOR -p ../visor_weights/coco_lr_fix_skip_0_1_release_resnet50_400000_32_399999.pth -backbone resnet50

Codalab Evaluation

If you want to participate in our EPIC-KITCHENS VISOR Semi-Supervised Video Object Segmentation Challenge, please refer to Semi-Supervised Codalab repository which contains all the needed details to create the submission file.


When use this repo, any of our models or dataset, you need to cite the VISOR paper

Citing VISOR

We use the code in the original STM implementation from official STM repository and the implementation from STM training repository. Using this code, you also need to cite STM

Citing STM

The code is published under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, found here.