Yuanbo2020 / Audio-Visual-VAD

MIT License
11 stars 3 forks source link

name 'regularize_frame_num' is not defined #3

Open NewEricWang opened 2 years ago

NewEricWang commented 2 years ago

another error: name 'self_attention' is not defined

Yuanbo2020 commented 2 years ago

Hi, 'regularize_frame_num' is the number of input pictures per second, please define it according to your dataset; and 'self_attention' is a boolean variable, which is set for compatibility with another program, and its value should be False in this program.

NewEricWang commented 2 years ago

Thanks for your quick replay! Miss definition for "load_files". Could you describe the function of "load_files"? What are its input, its output and their format?

Yuanbo2020 commented 2 years ago

Hi there,

Since load_files is related to the data platform of the company where I intern, I have no right to upload it, sorry for that.

Regarding the format of inputs, the model has an audio and a video branch. Taking N video clips with a length of 1 second as an example, assuming that the number of frames of the extracted acoustic features is 45, the video frame rate is 3 fps, and the size of each picture is 450*300 (height, width).

Then the input format of the audio branch is: (3 N, 15, 64), that is (batch size, time frames, frequency bins), where 3 is because 45/15=3. The input format of the video branch is: (3N, 450, 300), where 3 is because of 3 fps.

It means the acoustic feature (15, 64) of 1/3 s and the corresponding one picture (450, 300) are fed into the model each time, so the time resolution of the model is 1/3 s.

Hope this information is useful to you.

zhengx18 commented 2 years ago

hello, Yuanbo, i watched the data process that you mentioned above, and i want to ask a question, the image is gray picture ,not 3 channels, right?

Yuanbo2020 commented 2 years ago

@zhengx18 Yes, the 3 here is not the number of channels, but is determined by the number of frame rates of the video clips. The reason for this is that the server has limited computing resources and the amount of video data is huge.

zhengx18 commented 2 years ago

hi, Yuanbo ,there is another question with the function 'load_data_by_each_frame', as you said that , the video frame rate is 3 fps, how did you cut the frame and give them a frame-level label? In other words, i evenly cut 1 second video into 3 frames, we assume that the time points corresponding to the three frames are 0、0.5、1.0, how to give them a label, as the label time resolution is 0.001s, can you share me what is your method? @Yuanbo2020

zhengx18 commented 2 years ago

By the way , the function "load_files" return four variables: train_x_audio, train_y_audio, train_y, training_images_path_list, As I understand it, train_x_audio is N frame samples with 1/3s audio log-melspectrogram, train_y_audio is label of N frame samples with only audio annotations, train_y is label of N frame samples with both audio and visual annotations, training_images_path_list is N frame samples with jpg filenames, is that all right? As i understand it, train_x_audio is with shape of (N, 15, 64), training_images_path_list is with shape of (N, 1) and after reading by cv2, it changes to (N, 450, 300), but what is the shape of train_y and train_y_audio? I guess they are both (N, 4), is that right? @Yuanbo2020

Yuanbo2020 commented 2 years ago

hi, Yuanbo ,there is another question with the function 'load_data_by_each_frame', as you said that , the video frame rate is 3 fps, how did you cut the frame and give them a frame-level label? In other words, i evenly cut 1 second video into 3 frames, we assume that the time points corresponding to the three frames are 0、0.5、1.0, how to give them a label, as the label time resolution is 0.001s, can you share me what is your method? @Yuanbo2020

Suppose if we look at the video, 0----1.5 seconds is the anchor singing, 1.5----3 seconds the anchor is not singing. Then all the pictures in 0----1.5 seconds are labeled with vocalizing, and all the pictures in 1.5----3 seconds are labeled with non-vocalizing.

Yuanbo2020 commented 2 years ago

By the way , the function "load_files" return four variables: train_x_audio, train_y_audio, train_y, training_images_path_list, As I understand it, train_x_audio is N frame samples with 1/3s audio log-melspectrogram, train_y_audio is label of N frame samples with only audio annotations, train_y is label of N frame samples with both audio and visual annotations, training_images_path_list is N frame samples with jpg filenames, is that all right? As i understand it, train_x_audio is with shape of (N, 15, 64), training_images_path_list is with shape of (N, 1) and after reading by cv2, it changes to (N, 450, 300), but what is the shape of train_y and train_y_audio? I guess they are both (N, 4), is that right? @Yuanbo2020

Yes, your guess is correct. As you said, train_y_audio is the label corresponding to the audio branch. Since the audio branch here contains 4 output sub-branches, assuming that the current result is singing, the label of the one-hot vector corresponding to train_y_audio is [0, 0, 1, 0 ]. During training, in order to make each audio sub-branch learn its own target representation as independently as possible, train_y_audio is split into [0], [0], [1], [0]. This is exactly what lines 83 to 86 of the source code do: https://github.com/Yuanbo2020/Audio-Visual-VAD/blob/main/Code/framework/keras_data_generator.py

As for train_y, the operation is the same as for train_y_audio, and it also splits a 4-class task into 4 separate 2-class tasks. Such as lines 88 to 91 here: https://github.com/Yuanbo2020/Audio-Visual-VAD/blob/main/Code/framework/keras_data_generator.py