boheumd / MA-LMM

(2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
https://boheumd.github.io/MA-LMM/
MIT License
247 stars 27 forks source link

How to download raw videos from LVU dataset? #31

Open huaiyi66 opened 3 months ago

huaiyi66 commented 3 months ago

Hello, thank you for your excellent work.

When I tried to download the dataset from the LVU official link, I found that they did not provide the raw video, and many YouTube links are no longer available, how can I download the raw video of the LVU dataset?

I would appreciate it if you could provide the LVU datasets or download method.

YingYellow commented 3 months ago

+1

boheumd commented 3 months ago

Hi, I used the YouTube-dl to download the LVU dataset and some of the videos are unavailable. I have uploaded my downloaded video to google drive and you can download LVU raw videos through this link.

YingYellow commented 3 months ago

Thank you for your reply. Do you use the YouTube-dl to download the COIN dataset? I found some of them are also unavailable.

boheumd commented 3 months ago

Thank you for your reply. Do you use the YouTube-dl to download the COIN dataset? I found some of them are also unavailable.

Yes, I also used YouTube-dl to download the COIN dataset and only around 10500 videos are available.

jchsun1 commented 2 months ago

Thanks for your excellent work. I have encountered some problems and hope to get your help.

  1. There are two compression method (based on frame-level and token-level) in you paper, how is frame-based compression implemented? How to calculate the similarity of two frames containing multiple tokens? By calculate the average of similarity all tokens for adjacent frames?
  2. Temporal ordering information is injected into the frame-level features by a position embedding layer in the paper, but I found it does not taken effect because the weights are set to 0. blip2_vicuna_instruct.py - line 113 and 114
boheumd commented 2 months ago

Thanks for your excellent work. I have encountered some problems and hope to get your help.

  1. There are two compression method (based on frame-level and token-level) in you paper, how is frame-based compression implemented? How to calculate the similarity of two frames containing multiple tokens? By calculate the average of similarity all tokens for adjacent frames?
  2. Temporal ordering information is injected into the frame-level features by a position embedding layer in the paper, but I found it does not taken effect because the weights are set to 0. blip2_vicuna_instruct.py - line 113 and 114
  1. The frame-based compression is not included in the published code. The idea is to first flatten multiple tokens' features into one dimension and compute the cosine similarity between two frames.
  2. The temporal embedding weights are first initialized to 0. It will get updated through model training.
jchsun1 commented 2 months ago

Thanks for your reply. I run the model successfully,but the temporal embedding weights do not work. Could you tell me how can I get the pre-trained temporal embedding weights.