Joint detection and tracking model named DEFT, or ``Detection Embeddings for Tracking." Our approach relies on an appearance-based object matching network jointly-learned with an underlying object detection network. An LSTM is also added to capture motion constraints.
Could you provide model inference example (on frames series or video)?
Best regards