Epic-Kitchens (we used rgb frames)
Ego4D (we used fho subset)
Epic:
download hand crops for epic kitchens from the following repo
we preprocess the provided crops by applying union on the objects that touch with hands and all visible hands. We keep the default parameters from the respective library.
Put in a pickle in a format: dict[segment_id][frame_idx] = (left, top, right, bottom)
(Otherwise, the library works too long if used without preextracting and preprocessing)
save file with the name:hand_thd0.8_obj_thd0.01_ONLY_inter_obj_with_HANDS_v2
Download hand crops detection here for Ego4D and apply similar preprocessing: https://github.com/Chuhanxx/helping_hand_for_egocentric_videos
All splits of shared and unique (novel) noun and verb classes are in folder anno/
cd x-mic/Dassl.pytorch
# Install dependencies
pip install -r requirements.txt
# Install this library (no need to re-build if the source code is modified)
python setup.py develop
this step also can be skipped
Epic config: extract_EPIC_clip_vitb16_segments.yaml
To change:
DATASET.ROOT
- where your dataset is located with the structure DATASET.ROOT/annotations
, DATASET.ROOT/epic_kitchens_videos_256ss
and OUTPUT_DIR
Ego config: extract_EGO4D_clip_vitb16.yaml
To change:
DATASET.ROOT - where your dataset is located with the structure DATASET.ROOT/annotations, DATASET.ROOT/epic_kitchens_videos_256ss
DATA.PATH_TO_DATA_DIR:
- path to annotations
DATA.PATH_PREFIX:
- path to videos
DATASET.ROOT
- path to videos (same as path_prefix)
and OUTPUT_DIR
Epic config: extract_EPIC_clip_vitb16_segments_handcrops.yaml
see full frames +
DATASET.DETECTION_ROOT - path to hand crop annotations
Ego4d config: extract_EGO4D_clip_vitb16_handcrops.yaml
To run the script on a subset distributed over 8 gpus:
export OMP_NUM_THREADS=64; export NCCL_ASYNC_ERROR_HANDLING=1; torchrun --standalone --nproc_per_node=8 --nnodes 1 feat_extractor_segments_distributed.py --config_name XX --split YY --distributed --seed 42
To run the script on a subset on a single gpu:
python feat_extractor_segments.py --config_name
XX --split YY --div 0
XX - config name without “.yaml” extension and folder
YY - train or validation
Similarly, features can be extracted with DINO and Lavila models.
Config params:
DATA.PATH_TO_DATA_DIR
- Ego4D dataset annotations location
DATA.PATH_PREFIX
- Ego4D features that will be classified with adopted classifier - best results with hand cropped frames
DATA.PATH_PREFIX_DINO
- Ego4D features that will be adopted - best results with hand cropped frames
DATA.PATH_PREFIX_DINO
2 - Ego4D features that will be adopted. This and previous features will be combined in the adaptation module - best results with full frames
DATALOADER.FEATURES_NAME
- Epic features that will be classified with adopted classifier - best results with hand cropped frames
DATALOADER.FEATURES_NAME_DINO
- Epic features that will be adopted - best results with hand cropped frames
DATALOADER.FEATURES_NAME_DINO2
- Epic features that will be adopted. This and previous features will be combined in the adaptation module - best results with full frames
note that all these features can be the same. If use the model without hand crops, set DATALOADER.USE_DINO_FEATURES2
= False
Set resolution of conditioning features in DATALOADER.DINO_DIM
if it’s different from 512
If only one dataset is available, disable cross-dataset evaluation by setting TEST.CROSS_DATASET.EVAL = False
train X-MIC config: XMIC_vitb16.yaml
setup data or feature paths for one or two datasets
XX - name of the config file located in scripts/configs folder
With single gpu:
Epic nouns:
sh scripts/baselines/epic_gpu1.sh noun XX
Epic verbs:
sh scripts/baselines/epic_gpu1.sh verb XX
Ego4d nouns:
sh scripts/baselines/ego_gpu1.sh noun XX
Ego4d verbs:
sh scripts/baselines/ego_gpu1.sh verb XX
With 8 gpus:
Epic nouns:
sh scripts/baselines/epic_gpu8.sh noun XX
Epic verbs:
sh scripts/baselines/epic_gpu8.sh verb XX
Ego4d nouns:
sh scripts/baselines/ego_gpu8.sh noun XX
Ego4d verbs:
sh scripts/baselines/ego_gpu8.sh verb XX
Unfortunately, after my internship all models and data were deleted due to internal refactoring. Therefore, I lost all the pretrained models, parts of code and could not make a final verification of the code.
Feel free to connect with me via email in case of any questions.
I sincerely apologise for the inconvenience it may cause.
If you use our work, please consider citing:
@inproceedings{kukleva2024xmic,
title={X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization},
author={Kukleva, Anna and Sener, Fadime and Remelli, Edoardo and Tekin, Bugra and Sauser, Eric and Schiele, Bernt and Ma, Shugao},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}