Darshansingh11 / AVLectures

Official repository of the paper "Unsupervised Audio-Visual Lecture Segmentation", WACV 2023
Creative Commons Zero v1.0 Universal
10 stars 1 forks source link

Format of the avlectures_train.pkl and avlectures_helper.pkl #3

Closed tejasrjain closed 9 months ago

tejasrjain commented 1 year ago

I am trying to reproduce the results. But can you please let me know how should the input be provided?

Darshansingh11 commented 1 year ago

Hi, Thank you for your interest in our work. Please refer to the paper for the exact details. I will release the entire pipeline by the end of this month.

tejasrjain commented 1 year ago

Hello, Thanks for your reply and the great work you did. I tried to use the details as specified in the paper. I couldn't figure how to split the videos in 10-15 seconds as we also need to map that with ocr text as well. So how you make sure that the frame we have ocr text is also included in the frames we extracted the video features.

Darshansingh11 commented 10 months ago

Hi. Sorry, for the late response. I have updated the readme with instructions on how to train and evaluate the model. Please check it out. You can split any lecture of your choice into 10s-15s clips using ffmpeg and follow the instructions in readme to extract the features. Then, bind all the features into a single pickle file (similar to dataset_v1_helper.pkl which I will upload shortly. I am facing some issues while uploading large files). I am attaching a screenshot to help visualize the data format. data_format We obtain OCR for the last frame of a 10-15s clip.

haiahaiah commented 10 months ago

Hi~ Thanks for the update, I'd like to know your specific process for dividing the video into 10-15 seconds. According to my understanding, each video has a topic boundary label first, and then the video is divided into 10-15 second segments. In this process, it is necessary to ensure that the topic boundary is at the end of a certain segment to be classified. If so, will it involve label leakage?

Darshansingh11 commented 10 months ago

Hi, we divide a single lecture video into 10s-15s clips. This is independent of the boundary labels (we are doing segmentation in an unsupervised manner and do not rely on ground truth labels). 10s-15s clip is the atomic unit that we operate on. We just divide a video into N segments of 10s-15s each using ffmpeg. It is true that there can be some boundaries that can occur in between 10s-15s clips. This is captured by the BS@K metric. Hence, we have also experimented with smaller duration clips where the chances of boundary occurrence are less, i.e., with 4s-8s clips. Please check the supplementary section of our paper for more details.

Darshansingh11 commented 9 months ago

Hi. I am closing this issue. All the files are uploaded. In case you have any further doubts/queries please raise a new issue. Thanks!