Open Deep-I opened 4 years ago
Take CRNN for example. The input to the model is in form of 28x3x224x224 where 28 is the number of frames extracted from a video, 3 is the number of channels and 224x224 is the resized frame, from the original video. the target for this input is 1 label. As explained in the readme, the CNN (encoder) takes in this input and generates and encoding (feature vector) and passes it to RNN (decoder) which takes into account the temporal resolution of the video.
The dataset is consist of video and each video has one class(target). The video is captured by frame and the captured image is the input of the model. So can I ask the evaluation method of video classification? When evaluating the video classification model, I have to measure the accuracy of the label when one video is input? Or I have to measure the accuracy of each frame of video when each frame is input?