AssafSinger94 / dino-tracker

Official Pytorch Implementation for “DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video”
MIT License
361 stars 39 forks source link

Confidence score for predictions? #10

Closed AtaAtasoy closed 2 months ago

AtaAtasoy commented 3 months ago

Hi, first of all, congrats on your great work!

As far as I know, the visibility/occlusion value is inferred discretely (occluded or not) in model_inference.py with:

def compute_occ_pred_for_qp(self, green_trajectories_qp: torch.tensor, source_trajectories_qp: torch.tensor, traj_cos_sim_qp: torch.tensor, anch_sim_th: float, cos_sim_th: float):
  visible_at_st_frame_qp = traj_cos_sim_qp >= anch_sim_th
  dists_from_source = torch.norm(green_trajectories_qp - source_trajectories_qp[visible_at_st_frame_qp, :].unsqueeze(1), dim=-1)  # dists_from_source (M x T), dists_from_source[anchor_t, source_t] = dist

  anchor_median_errors = torch.median(dists_from_source[:, visible_at_st_frame_qp], dim=0).values  # T_vis
  median_anchor_dist_th = anchor_median_errors.max()  # float
  dists_from_source_anchor_vis = dists_from_source  # (T_vis x T)
  median_dists_from_source_anchor_vis = torch.median(dists_from_source_anchor_vis, dim=0).values  # T
  return ((median_dists_from_source_anchor_vis > median_anchor_dist_th) | (traj_cos_sim_qp < cos_sim_th))

Is there a way to calculate some "confidence" value used in some trackers or detectors? The goal would be to use it as a weight for the tracker's trajectory prediction.

Would it make sense to use a value such as median_dists_from_source_anchor_vis * traj_cos_sim_qp (if not occluded) or perhaps occlusion * cos_sims where cos_sims is the result of compute_trajectory_cos_sims in model_inference.py?

I would appreciate any guidance regarding this issue. Thank you for your time.

tnarek commented 3 months ago

Hi @AtaAtasoy, thanks for your question! Given a query point x_q and a tracked prediction x_t, there are several signals that can indicate the prediction confidence, most of which you already mentioned:

  1. Correlation of refined features sampled at x_q and x_t
  2. The alignment between trajectories originating at x_q and x_t. median_dists_from_source_anchor_vis can serve for this, but you may also consider all of the frames, and not just the anchor frames
  3. The unimodality of the similarity of x_q to the features in frame t, i.e. how confidently does the refined feature of x_q match to the tracked feature x_t.

Let me know if you'll have further questions.

tnarek commented 2 months ago

@AtaAtasoy I'm closing this issue for now. If you'll have further questions, feel free to open it again.