[x] Encode videos separately with the same CNN (+ FCNN / LSTM / Attention) network and concatenate their feature vectors
[ ] After encoding with a feature detector, append a few fully connected layers and add the SINDy constraint that encourages the features to reflect the physics behavior (best case: learn to extract the trajectories from the video)
[ ] Use algorithmic stereo reconstruction to extract the 3D trajectory for use as the condition