JaywongWang / DenseVideoCaptioning

Official Tensorflow Implementation of the paper "Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning" in CVPR 2018, with code, model and prediction results.
MIT License
148 stars 50 forks source link

Is this code work for a general video i.e, in the test.py it requires anchor file for any video and for anchor file we need temporal segments.So How we use this code for captioning some other video. #18

Closed fascinet closed 5 years ago

JaywongWang commented 5 years ago

Our work is based on SST, which requires pre-defined anchors. Therefore, the model can only produce temporal segments that are no longer than the pre-defined anchors. Still, you can directly use the model to caption videos with any duration. But the performance is not guaranteed.

fascinet commented 5 years ago

To caption a video, We need to make a JSON file having the video id, duration, and temporal segments {"v_uqiMw7tQ1Cc": {"duration": 55.15, "timestamps": [[0, 4.14], [4.14, 33.36], [33.36, 55.15]], "sentences": " "} then it will captions these particular timestapms but how can we define these timestamps for a general video, as these are need to find anchors which further required for test.py file. So basically this code works only for activity net dataset?

JaywongWang commented 5 years ago

The dense captioning model is able to generate temporal proposals (segments that possibly cover an event) and to describe these proposals. At inference, the timestamps are predicted by the model. You can directly use our provided anchors.

JaywongWang commented 5 years ago

The JSON file is used for training the dense captioning model. At inference time, you don't need that.

fascinet commented 5 years ago

Which code of yours predicts the timestamps for a video? My Idea is to use your test.py file to caption my video.. with the pre-trained model but the test.py file need three files feature, idx and anchor file problem is to generate this anchor file Screenshot (129) how to generate this because getanchor.py file need timestamps to generate the anchor.txt so how to get that timestamps?

JaywongWang commented 5 years ago
  1. You need to first extract features for a video (if it is not from ActivityNet dataset).
  2. You need to get the frame rate of each input video.
  3. You can use my provided anchors.txt for captioning another video. However, since SST use pre-defined anchors, the current largest anchor is about 200 seconds (meaning that you can hardly predict longer action proposals than 200 seconds).
JaywongWang commented 5 years ago

The anchors are associated with a pre-trained model. You do not need to generate new anchors. If the input video has totally different characteristics (duration, frame rate, etc.), I suggest re-training the model.

fascinet commented 5 years ago

My question is even after retraining the model we need timestamps for each video? How can we generate this timestamps for any video

train_data = json.load(open('../../%s.json'%'train')) video_ids = open(os.path.join(out_proposal_source, 'train', 'ids.txt')).readlines() video_ids = [video_id.strip() for video_id in video_ids] feature_lengths = dict() proposal_lengths = [] for video_id in video_ids: data = train_data[video_id] timestamps = data['timestamps']

{"My video": {"duration": 55.15, "timestamps": [[0, 4.14], [4.14, 33.36], [33.36, 55.15]], "sentences": " "} how you have provided the timestamps for the video and how we generate this timestamps for my video? Is there any code for that?

JaywongWang commented 5 years ago

Are the following codes written by you ? (If not, please point out where I can find these codes)

train_data = json.load(open('../../%s.json'%'train')) video_ids = open(os.path.join(out_proposal_source, 'train', 'ids.txt')).readlines() video_ids = [video_id.strip() for video_id in video_ids] feature_lengths = dict() proposal_lengths = [] for video_id in video_ids: data = train_data[video_id] timestamps = data['timestamps']

JaywongWang commented 5 years ago

My point is when training a model, we need annotated timestamps for learning proposal module. When testing a new video, the timestamps are predicted by the trained model. As for the anchors, after rurnning dataset/ActivityNet_Captions/preprocess/anchors/get_anchors.py (even before training the model, it is pre-determined), the generated anchors are used in both training&testing phases.

JaywongWang commented 5 years ago

Feel free to ask. And please point out if there is any misunderstanding.

fascinet commented 5 years ago

@JaywongWang Can you explain to me what are values in anchor.txt and how can I use it to predict caption of other videos. As per my understanding, Anchor.txt file is generated using the timestamps provided in JSON file of train videos(activity net dataset), how can be it used for other videos? And if not how to generate those anchors file for my videos without providing the timestamps.

JaywongWang commented 5 years ago

@fascinet anchor.txt contains possible segment length for events. If you have no annotated timestamps, the only way is to use my provided anchor.txt. If you change the anchor.txt in any way, you have to finetune or re-train the model.

fascinet commented 5 years ago

@JaywongWang So if I use the provided anchor.txt, that means for some other video, I am restricting the segment length for the events, which might not true for some videos.

JaywongWang commented 5 years ago

Yes. That is one of the main weaknesses of anchor-based approaches.

fascinet commented 5 years ago

Thank you for your answers.