How to adapt this for video re-identification?

Is it possible to adapt this model for video-based person re-identification? Could we attach an LSTM layer towards the end in order to aggregate the tensors over all the video frames? If we do this and change the input layers to receive a video, could we input 2 video sequences into the model and expect it to output whether they're the same person or different? Would this work? Because it seems like a simple solution for video person re-id and yet I can't find anyone who's done it.

digitalbrain79 / person-reid

How to adapt this for video re-identification? #10