unsup-step-pred

Transformer-based unspervised insturction step prediction model, called StepFormer. It is built as a transformer decoder with learnable queries; the queries attend to the video and producde step slots that are likely to represent instruction steps. StepFormer is trained on a large (unlabeled) instructionla video dataset, HowTo100M, and uses narration text as a supervisory signal, to extract procedure steps from the video.

Setup: make sure that all the paths in paths.py point to the appropriate folders. Also, make sure to install all the necessary packages, such as pytorch, pytorch-lightning, wandb (optional, can be switched from flags) and webdataset.
Training: run python train.py to train the best performing model from the submission
Evaluation: run python test.py to get unsupervised segmentation and zero-shot step localization performance, as reported in Tables 1 and 2 of our sumbission.
Demo: open demo.ipynp

SamsungLabs / StepFormer

readme

unsup-step-pred