Open NewEricWang opened 2 years ago
Hi, 'regularize_frame_num' is the number of input pictures per second, please define it according to your dataset; and 'self_attention' is a boolean variable, which is set for compatibility with another program, and its value should be False in this program.
Thanks for your quick replay! Miss definition for "load_files". Could you describe the function of "load_files"? What are its input, its output and their format?
Hi there,
Since load_files is related to the data platform of the company where I intern, I have no right to upload it, sorry for that.
Regarding the format of inputs, the model has an audio and a video branch. Taking N video clips with a length of 1 second as an example, assuming that the number of frames of the extracted acoustic features is 45, the video frame rate is 3 fps, and the size of each picture is 450*300 (height, width).
Then the input format of the audio branch is: (3 N, 15, 64), that is (batch size, time frames, frequency bins), where 3 is because 45/15=3. The input format of the video branch is: (3N, 450, 300), where 3 is because of 3 fps.
It means the acoustic feature (15, 64) of 1/3 s and the corresponding one picture (450, 300) are fed into the model each time, so the time resolution of the model is 1/3 s.
Hope this information is useful to you.
hello, Yuanbo, i watched the data process that you mentioned above, and i want to ask a question, the image is gray picture ,not 3 channels, right?
@zhengx18 Yes, the 3 here is not the number of channels, but is determined by the number of frame rates of the video clips. The reason for this is that the server has limited computing resources and the amount of video data is huge.
hi, Yuanbo ,there is another question with the function 'load_data_by_each_frame', as you said that , the video frame rate is 3 fps, how did you cut the frame and give them a frame-level label? In other words, i evenly cut 1 second video into 3 frames, we assume that the time points corresponding to the three frames are 0、0.5、1.0, how to give them a label, as the label time resolution is 0.001s, can you share me what is your method? @Yuanbo2020
By the way , the function "load_files" return four variables: train_x_audio, train_y_audio, train_y, training_images_path_list, As I understand it, train_x_audio is N frame samples with 1/3s audio log-melspectrogram, train_y_audio is label of N frame samples with only audio annotations, train_y is label of N frame samples with both audio and visual annotations, training_images_path_list is N frame samples with jpg filenames, is that all right? As i understand it, train_x_audio is with shape of (N, 15, 64), training_images_path_list is with shape of (N, 1) and after reading by cv2, it changes to (N, 450, 300), but what is the shape of train_y and train_y_audio? I guess they are both (N, 4), is that right? @Yuanbo2020
hi, Yuanbo ,there is another question with the function 'load_data_by_each_frame', as you said that , the video frame rate is 3 fps, how did you cut the frame and give them a frame-level label? In other words, i evenly cut 1 second video into 3 frames, we assume that the time points corresponding to the three frames are 0、0.5、1.0, how to give them a label, as the label time resolution is 0.001s, can you share me what is your method? @Yuanbo2020
Suppose if we look at the video, 0----1.5 seconds is the anchor singing, 1.5----3 seconds the anchor is not singing. Then all the pictures in 0----1.5 seconds are labeled with vocalizing, and all the pictures in 1.5----3 seconds are labeled with non-vocalizing.
By the way , the function "load_files" return four variables: train_x_audio, train_y_audio, train_y, training_images_path_list, As I understand it, train_x_audio is N frame samples with 1/3s audio log-melspectrogram, train_y_audio is label of N frame samples with only audio annotations, train_y is label of N frame samples with both audio and visual annotations, training_images_path_list is N frame samples with jpg filenames, is that all right? As i understand it, train_x_audio is with shape of (N, 15, 64), training_images_path_list is with shape of (N, 1) and after reading by cv2, it changes to (N, 450, 300), but what is the shape of train_y and train_y_audio? I guess they are both (N, 4), is that right? @Yuanbo2020
Yes, your guess is correct. As you said, train_y_audio is the label corresponding to the audio branch. Since the audio branch here contains 4 output sub-branches, assuming that the current result is singing, the label of the one-hot vector corresponding to train_y_audio is [0, 0, 1, 0 ]. During training, in order to make each audio sub-branch learn its own target representation as independently as possible, train_y_audio is split into [0], [0], [1], [0]. This is exactly what lines 83 to 86 of the source code do: https://github.com/Yuanbo2020/Audio-Visual-VAD/blob/main/Code/framework/keras_data_generator.py
As for train_y, the operation is the same as for train_y_audio, and it also splits a 4-class task into 4 separate 2-class tasks. Such as lines 88 to 91 here: https://github.com/Yuanbo2020/Audio-Visual-VAD/blob/main/Code/framework/keras_data_generator.py
another error: name 'self_attention' is not defined