To model the temporal aspect of the project, we can implement a CNN that feeds a sequence of video frame representations into an LSTM (a type of recurrent network that handles sequential data well). I didn't go deep enough to see if anyone takes credit for the original idea, but there are plenty of examples of this, like paper 1 - video analysis (medical) and paper 2 - bidirectional model (linguistics).
Some pertinent questions are if we want to treat each frame as a time-step, or combine multiple frames into a time-step. The latter could achieved by something like downsampling or running the CNN on multiple frames treated as a channel. Also there are further design choices, like if we want a bidirectional LSTM (considers info "before" and "after" a time-step) or not (just keep track of the context of the past time-steps). Lastly I'll just mention that usually this paradigm often uses a pretrained cnn model, and I'm assuming we won't have meaningful pretraining.
To model the temporal aspect of the project, we can implement a CNN that feeds a sequence of video frame representations into an LSTM (a type of recurrent network that handles sequential data well). I didn't go deep enough to see if anyone takes credit for the original idea, but there are plenty of examples of this, like paper 1 - video analysis (medical) and paper 2 - bidirectional model (linguistics).
Some pertinent questions are if we want to treat each frame as a time-step, or combine multiple frames into a time-step. The latter could achieved by something like downsampling or running the CNN on multiple frames treated as a channel. Also there are further design choices, like if we want a bidirectional LSTM (considers info "before" and "after" a time-step) or not (just keep track of the context of the past time-steps). Lastly I'll just mention that usually this paradigm often uses a pretrained cnn model, and I'm assuming we won't have meaningful pretraining.