Generating captions for other videos

shamanthak-hegde commented 1 year ago

Hi, Is there a script to generate captions for videos other than ActivityNet or YouCook2? Could you let me know what changes needs to be made to just generate captions for a video which does not have any annotations? Thanks

Kashu7100 commented 1 year ago

Thank you for your interest. To generate the caption for videos, you need to:

1. Extract features: follow README to extract env, agent, lang features.
1. Prepare event proposal: since the task is VPC, the network requires the event proposals as input. If your video does not have such annotation, you can either generate with some TAPG network or set the interval manually.
1. Decode the caption: with the features and event proposals, you should be able to run the network to generate the caption.

Let me know if you have further questions!

shamanthak-hegde commented 1 year ago

Hey! Thanks for the response. I don't have the captions for the videos. Correct me if I'm wrong, but without the captions and it's timestamps I won't be able to extract the language features itself. Can you confirm this? If so, how do I get around that? Thanks

Kashu7100 commented 1 year ago

Oh, I see your point. For the language feature extraction, if the domain is similar, you can use the vocabularies from Activitynet or Youcook2 (and you don't need the timestamps for feature extraction). If you want to prepare a dictionary by yourself (with ChatGPT or something), it should also work.

Fredham commented 12 months ago

Hi, Thanks for your sharing! I tried to follow steps you mentioned to extract features. I seems that files are in "preprocess" dir. But I didn't find file which extracts env feature.And I am confused about how to run those files. Can you provide steps about how to run files to extract feature? Thanks!

Kashu7100 commented 12 months ago

Hi, @Fredham

Thanks for your interest. The env features and the agent features are based on C3D, which you can extract using SlowFast folk (as mentioned in the Env feature extraction section). This being said, you need to setup the SlowFast folk to extract the feature.

git clone https://github.com/vhvkhoa/SlowFast
cd SlowFast
python setup.py build develop
python tools/run_net.py --cfg configs/Kinetics/SLOWONLY_8x8_R50.yaml --feature_extraction --num_features 100 --video_dir path/to/dir/rescaled --feat_dir path/to/data/[anet/yc2]/c3d_env TEST.CHECKPOINT_FILE_PATH models/SLOWONLY_8x8_R50.pkl NUM_GPUS 1 TEST.CHECKPOINT_TYPE caffe2 TEST.BATCH_SIZE 1 DATA.SAMPLING_RATE 1 DATA.NUM_FRAMES 16 DATA_LOADER.NUM_WORKERS 0

The command above is for the SlowFast repo (not in this repo). Note that you might want to use the preprocess/convert_to_mp4.py and/or preprocess/rescale_video.py from this repo to convert your video into mp4 and rescale the frames of the video.

For the agent feature, you need detectron folk in addition to the SlowFast to get the bounding box of the agents.

git clone https://github.com/vhvkhoa/detectron2
python -m pip install -e detectron2
wget https://dl.fbaipublicfiles.com/detectron2/COCO-Detection/faster_rcnn_R_101_FPN_3x/137851257/model_final_f6e8b1.pkl
python tools/bbox_extract.py path/to/dir/rescaled path/to/dir/bbox --config-file configs/COCO-Detection/faster_rcnn_R_101_FPN_3x.yaml --sampling-rate 16 --target-frames 100 --opts MODEL.WEIGHTS model_final_f6e8b1.pkl

After you obtained the bounding box, you can use the SlowFast folk to extract the agent features.

# navigate to SlowFast
python tools/run_net.py --cfg configs/Kinetics/SLOWONLY_8x8_R50.yaml --feature_extraction --num_features 100 --video_dir path/to/dir/rescaled --feat_dir path/to/data/[anet/yc2]/c3d_agent MODEL.NUM_CLASSES 200 TEST.CHECKPOINT_TYPE caffe2 TEST.CHECKPOINT_FILE_PATH models/SLOWONLY_8x8_R50.pkl NUM_GPUS 1 TEST.BATCH_SIZE 1 DATA.PATH_TO_BBOX_DIR path/to/dir/bbox DETECTION.ENABLE True DETECTION.SPATIAL_SCALE_FACTOR 32 DATA.SAMPLING_RATE 1 DATA.NUM_FRAMES 16 RESNET.SPATIAL_STRIDES [[1],[2],[2],[1]] RESNET.SPATIAL_DILATIONS [[1],[1],[1],[2]] DATA.PATH_TO_TMP_DIR /tmp/agent_0/

Hope this helps.

Fredham commented 12 months ago

Hi, @Fredham

Thanks for your interest. The env features and the agent features are based on C3D, which you can extract using SlowFast folk (as mentioned in the Env feature extraction section). This being said, you need to setup the SlowFast folk to extract the feature.
git clone https://github.com/vhvkhoa/SlowFast
cd SlowFast
python setup.py build develop
python tools/run_net.py --cfg configs/Kinetics/SLOWONLY_8x8_R50.yaml --feature_extraction --num_features 100 --video_dir path/to/dir/rescaled --feat_dir path/to/data/[anet/yc2]/c3d_env TEST.CHECKPOINT_FILE_PATH models/SLOWONLY_8x8_R50.pkl NUM_GPUS 1 TEST.CHECKPOINT_TYPE caffe2 TEST.BATCH_SIZE 1 DATA.SAMPLING_RATE 1 DATA.NUM_FRAMES 16 DATA_LOADER.NUM_WORKERS 0
The command above is for the SlowFast repo (not in this repo). Note that you might want to use the preprocess/convert_to_mp4.py and/or preprocess/rescale_video.py from this repo to convert your video into mp4 and rescale the frames of the video.

For the agent feature, you need detectron folk in addition to the SlowFast to get the bounding box of the agents.
git clone https://github.com/vhvkhoa/detectron2
python -m pip install -e detectron2
wget https://dl.fbaipublicfiles.com/detectron2/COCO-Detection/faster_rcnn_R_101_FPN_3x/137851257/model_final_f6e8b1.pkl
python tools/bbox_extract.py path/to/dir/rescaled path/to/dir/bbox --config-file configs/COCO-Detection/faster_rcnn_R_101_FPN_3x.yaml --sampling-rate 16 --target-frames 100 --opts MODEL.WEIGHTS model_final_f6e8b1.pkl
After you obtained the bounding box, you can use the SlowFast folk to extract the agent features.
# navigate to SlowFast
python tools/run_net.py --cfg configs/Kinetics/SLOWONLY_8x8_R50.yaml --feature_extraction --num_features 100 --video_dir path/to/dir/rescaled --feat_dir path/to/data/[anet/yc2]/c3d_agent MODEL.NUM_CLASSES 200 TEST.CHECKPOINT_TYPE caffe2 TEST.CHECKPOINT_FILE_PATH models/SLOWONLY_8x8_R50.pkl NUM_GPUS 1 TEST.BATCH_SIZE 1 DATA.PATH_TO_BBOX_DIR path/to/dir/bbox DETECTION.ENABLE True DETECTION.SPATIAL_SCALE_FACTOR 32 DATA.SAMPLING_RATE 1 DATA.NUM_FRAMES 16 RESNET.SPATIAL_STRIDES [[1],[2],[2],[1]] RESNET.SPATIAL_DILATIONS [[1],[1],[1],[2]] DATA.PATH_TO_TMP_DIR /tmp/agent_0/
Hope this helps.

Thanks a lot!You are so generous!

UARK-AICV / VLTinT

Generating captions for other videos #9