Joint detection and tracking model named DEFT, or ``Detection Embeddings for Tracking." Our approach relies on an appearance-based object matching network jointly-learned with an underlying object detection network. An LSTM is also added to capture motion constraints.
In the paper there is an evaluation on the nuScenes validation front camera videos, according to difficulty factors of occlusion and inter-frame displacement. Were the visibility tokens used to establish if an instance was occluded? If so, what was considered occluded? Could the script for this eval be shared?
In the paper there is an evaluation on the nuScenes validation front camera videos, according to difficulty factors of occlusion and inter-frame displacement. Were the visibility tokens used to establish if an instance was occluded? If so, what was considered occluded? Could the script for this eval be shared?