Closed fascinet closed 5 years ago
To caption a video, We need to make a JSON file having the video id, duration, and temporal segments {"v_uqiMw7tQ1Cc": {"duration": 55.15, "timestamps": [[0, 4.14], [4.14, 33.36], [33.36, 55.15]], "sentences": " "} then it will captions these particular timestapms but how can we define these timestamps for a general video, as these are need to find anchors which further required for test.py file. So basically this code works only for activity net dataset?
The dense captioning model is able to generate temporal proposals (segments that possibly cover an event) and to describe these proposals. At inference, the timestamps are predicted by the model. You can directly use our provided anchors.
The JSON file is used for training the dense captioning model. At inference time, you don't need that.
Which code of yours predicts the timestamps for a video? My Idea is to use your test.py file to caption my video.. with the pre-trained model but the test.py file need three files feature, idx and anchor file problem is to generate this anchor file how to generate this because getanchor.py file need timestamps to generate the anchor.txt so how to get that timestamps?
The anchors are associated with a pre-trained model. You do not need to generate new anchors. If the input video has totally different characteristics (duration, frame rate, etc.), I suggest re-training the model.
My question is even after retraining the model we need timestamps for each video? How can we generate this timestamps for any video
train_data = json.load(open('../../%s.json'%'train')) video_ids = open(os.path.join(out_proposal_source, 'train', 'ids.txt')).readlines() video_ids = [video_id.strip() for video_id in video_ids] feature_lengths = dict() proposal_lengths = [] for video_id in video_ids: data = train_data[video_id] timestamps = data['timestamps']
{"My video": {"duration": 55.15, "timestamps": [[0, 4.14], [4.14, 33.36], [33.36, 55.15]], "sentences": " "} how you have provided the timestamps for the video and how we generate this timestamps for my video? Is there any code for that?
Are the following codes written by you ? (If not, please point out where I can find these codes)
train_data = json.load(open('../../%s.json'%'train')) video_ids = open(os.path.join(out_proposal_source, 'train', 'ids.txt')).readlines() video_ids = [video_id.strip() for video_id in video_ids] feature_lengths = dict() proposal_lengths = [] for video_id in video_ids: data = train_data[video_id] timestamps = data['timestamps']
My point is when training a model, we need annotated timestamps for learning proposal module. When testing a new video, the timestamps are predicted by the trained model. As for the anchors, after rurnning dataset/ActivityNet_Captions/preprocess/anchors/get_anchors.py (even before training the model, it is pre-determined), the generated anchors are used in both training&testing phases.
Feel free to ask. And please point out if there is any misunderstanding.
@JaywongWang Can you explain to me what are values in anchor.txt and how can I use it to predict caption of other videos. As per my understanding, Anchor.txt file is generated using the timestamps provided in JSON file of train videos(activity net dataset), how can be it used for other videos? And if not how to generate those anchors file for my videos without providing the timestamps.
@fascinet anchor.txt contains possible segment length for events. If you have no annotated timestamps, the only way is to use my provided anchor.txt. If you change the anchor.txt in any way, you have to finetune or re-train the model.
@JaywongWang So if I use the provided anchor.txt, that means for some other video, I am restricting the segment length for the events, which might not true for some videos.
Yes. That is one of the main weaknesses of anchor-based approaches.
Thank you for your answers.
Our work is based on SST, which requires pre-defined anchors. Therefore, the model can only produce temporal segments that are no longer than the pre-defined anchors. Still, you can directly use the model to caption videos with any duration. But the performance is not guaranteed.