Open qsisi opened 2 months ago
Hi @qsisi ,
I am not sure if I understand the question correctly. For all the datasets, we train a tracker with the frame number of range(3,31). It means, we randomly sample 3-31 frames for a scene and feed them directly to the tracker. It can fit well in a 80 GB A100 GPU (with bf16), with the number of tracks as 512.
Thanks for your prompt reply. So for every batch in training, the video length is a random number ∈[3,31], for kubric/MegaDepth/Co3D, right?
Also, in my trials, the tracker performance varies between your previous released model vggsfm_v102.bin and the current model vggsfm_v2_0_0.bin, and it seems like the previous model is better than the current one :) May I ask what happened here? Do these two models' tracker part were trained with different strategies?
It is the same for kubric/MegaDepth/Co3D. However, it should be noted that in the original kubric dataset, the videos may only have 24 frames, so it should be [3,24].
Can you elaborate about how did you evaluate the performance of two trackers? In our internal experiments, the new one should be more accurate and generalize better.
Thanks for your reply.
It is the same for kubric/MegaDepth/Co3D. However, it should be noted that in the original kubric dataset, the videos may only have 24 frames, so it should be [3,24].
Can you elaborate about how did you evaluate the performance of two trackers? In our internal experiments, the new one should be more accurate and generalize better.
from vggsfm.models.track_modules.base_track_predictor import BaseTrackerPredictor
from vggsfm.models.track_modules.blocks import BasicEncoder
from gluefactory.models.extractors.superpoint_open import SuperPoint
from gluefactory.models.extractors.sift import SIFT
import torch.nn.functional as F
from omegaconf import DictConfig, OmegaConf
import hydra
import torch
import numpy as np
import cv2 as cv
from collections import Counter
import matplotlib.pyplot as plt
import copy
def get_query_points(superpoint, sift, query_image, max_query_num=4096):
pred_sp = superpoint({"image": query_image})["keypoints"]
pred_sift = sift({"image": query_image})["keypoints"]
query_points = torch.cat([pred_sp, pred_sift], dim=1)
query_points = pred_sift
if query_points.shape[1] > max_query_num:
random_point_indices = torch.randperm(query_points.shape[1])[:max_query_num]
query_points = query_points[:, random_point_indices, :]
return query_points
@hydra.main(config_path="cfgs/", config_name="demo") def main(cfg: DictConfig): OmegaConf.set_struct(cfg, False) fnet_v1 = BasicEncoder(cfg=cfg).cuda() tracker_v1 = BaseTrackerPredictor(cfg=cfg).cuda() fnet_v2 = BasicEncoder(cfg=cfg).cuda() tracker_v2 = BaseTrackerPredictor(cfg=cfg).cuda() ckpt_v1 = torch.load("/data/vggsfm/vggsfm_v102.bin") ckpt_v2 = torch.load("/data/vggsfm_v2/vggsfm/vggsfm_v2_0_0.bin")
fnet_dict_from_ckpt_v1 = {key.replace("track_predictor.coarse_fnet.", "") : val for key, val in ckpt_v1.items() if "track_predictor.coarse_fnet" in key}
assert fnet_dict_from_ckpt_v1.keys() == fnet_v1.state_dict().keys()
fnet_v1.load_state_dict(fnet_dict_from_ckpt_v1, strict=True)
tracker_dict_from_ckpt_v1 = {key.replace("track_predictor.coarse_predictor.", "") : val for key, val in ckpt_v1.items() if "track_predictor.coarse_predictor" in key}
assert tracker_dict_from_ckpt_v1.keys() == tracker_v1.state_dict().keys()
tracker_v1.load_state_dict(tracker_dict_from_ckpt_v1, strict=True)
fnet_dict_from_ckpt_v2 = {key.replace("track_predictor.coarse_fnet.", "") : val for key, val in ckpt_v2.items() if "track_predictor.coarse_fnet" in key}
assert fnet_dict_from_ckpt_v2.keys() == fnet_v2.state_dict().keys()
fnet_v2.load_state_dict(fnet_dict_from_ckpt_v2, strict=True)
tracker_dict_from_ckpt_v2 = {key.replace("track_predictor.coarse_predictor.", "") : val for key, val in ckpt_v2.items() if "track_predictor.coarse_predictor" in key}
assert tracker_dict_from_ckpt_v2.keys() == tracker_v2.state_dict().keys()
tracker_v2.load_state_dict(tracker_dict_from_ckpt_v2, strict=True)
superpoint = SuperPoint({"nms_radius": 4, "force_num_keypoints": True}).cuda().eval()
sift = SIFT({}).cuda().eval()
fnet_v1.eval()
tracker_v1.eval()
fnet_v2.eval()
tracker_v2.eval()
imgs = []
imgs.append(cv.resize(cv.imread("/data/South-Building/images/P1180141.JPG"), (1920, 1080))[None, ...])
imgs.append(cv.resize(cv.imread("/data/South-Building/images/P1180142.JPG"), (1920, 1080))[None, ...])
imgs.append(cv.resize(cv.imread("/data/South-Building/images/P1180143.JPG"), (1920, 1080))[None, ...])
imgs.append(cv.resize(cv.imread("/data/South-Building/images/P1180144.JPG"), (1920, 1080))[None, ...])
imgs.append(cv.resize(cv.imread("/data/South-Building/images/P1180145.JPG"), (1920, 1080))[None, ...])
imgs.append(cv.resize(cv.imread("/data/South-Building/images/P1180146.JPG"), (1920, 1080))[None, ...])
imgs.append(cv.resize(cv.imread("/data/South-Building/images/P1180147.JPG"), (1920, 1080))[None, ...])
imgs.append(cv.resize(cv.imread("/data/South-Building/images/P1180148.JPG"), (1920, 1080))[None, ...])
imgs_numpy = np.concatenate(imgs)
video = torch.from_numpy(imgs_numpy).unsqueeze(0).permute(0, 1, 4, 2, 3).float().cuda() / 255.
query_points = get_query_points(superpoint, sift, video[0, 0:1, ...]).cuda()
original_h, original_w = 1080, 1920
new_h, new_w = 384, 512
stride = 4.
video = F.interpolate(video, size=(3, new_h, new_w))
query_points[:, :, 0] *= new_w / original_w
query_points[:, :, 1] *= new_h / original_h
## version1
with torch.no_grad():
fmaps = fnet_v1(video.flatten(0, 1)).unsqueeze(0)
coord_pred_list, vis_pred = tracker_v1(query_points / stride, fmaps)
coord_pred = coord_pred_list[-1]
# visualize
imgs_to_show = copy.deepcopy(imgs)
imgs_to_show = [cv.resize(img[0], (new_w, new_h)) for img in imgs_to_show]
for i in range(coord_pred.shape[2]):
if (vis_pred[0, :, i] > 0.05).sum().item() == len(imgs):
for j in range(len(imgs_to_show)):
if vis_pred[0,j,i].item() > 0.05:
cv.circle(imgs_to_show[j], (int(coord_pred[0,j,i,0]), int(coord_pred[0,j,i,1])), 3, (0, 0, 255), -1)
cat_img = cv.hconcat(imgs_to_show)
cat_img_left, cat_img_right = cat_img[:, :cat_img.shape[1]//2], cat_img[:, cat_img.shape[1]//2:]
cat_img = cv.vconcat([cat_img_left, cat_img_right])
cv.imwrite(f"track_full_video_version_1.png", cat_img)
track_length = [(vis_pred[0,:,i] > 0.05).sum().item() for i in range(vis_pred.shape[2])]
track_length = np.array(track_length)
ct = Counter(track_length)
track_count_version_1 = [ct[i] for i in range(1, len(imgs) + 1)]
## version2
with torch.no_grad():
fmaps = fnet_v2(video.flatten(0, 1)).unsqueeze(0)
coord_pred_list, vis_pred = tracker_v2(query_points / stride, fmaps)
coord_pred = coord_pred_list[-1]
# visualize
imgs_to_show = copy.deepcopy(imgs)
imgs_to_show = [cv.resize(img[0], (new_w, new_h)) for img in imgs_to_show]
for i in range(coord_pred.shape[2]):
if (vis_pred[0, :, i] > 0.05).sum().item() == len(imgs):
for j in range(len(imgs_to_show)):
if vis_pred[0,j,i].item() > 0.05:
cv.circle(imgs_to_show[j], (int(coord_pred[0,j,i,0]), int(coord_pred[0,j,i,1])), 3, (0, 0, 255), -1)
cat_img = cv.hconcat(imgs_to_show)
cat_img_left, cat_img_right = cat_img[:, :cat_img.shape[1]//2], cat_img[:, cat_img.shape[1]//2:]
cat_img = cv.vconcat([cat_img_left, cat_img_right])
cv.imwrite(f"track_full_video_version_2.png", cat_img)
track_length = [(vis_pred[0,:,i] > 0.05).sum().item() for i in range(vis_pred.shape[2])]
track_length = np.array(track_length)
ct = Counter(track_length)
track_count_version_2 = [ct[i] for i in range(1, len(imgs) + 1)]
x = [i for i in range(1, len(imgs) + 1)]
plt.plot(x, track_count_version_1, color='b', label='vgg_v1')
plt.plot(x, track_count_version_2, color='r', label='vgg_v2')
plt.xlabel("Track Length")
plt.ylabel("Count")
plt.legend()
plt.savefig("vgg_v1&v2_comparison.png")
if name == "main": main()
for images P1180141 ~ P1180148, the track count curve:
![vgg_v1 v2_comparison](https://github.com/facebookresearch/vggsfm/assets/44374058/31ddf121-20bd-4ef5-b652-d32c8dc82203)
for P1180181 ~ P1180188:
![vgg_v1 v2_comparison](https://github.com/facebookresearch/vggsfm/assets/44374058/f95f75d3-19f0-4276-b986-8c3ab45b0739)
for P1180201 ~ P1180208:
![vgg_v1 v2_comparison](https://github.com/facebookresearch/vggsfm/assets/44374058/90e4b7fd-588a-4899-9fda-13423976d5dc)
and for other image inputs, the difference between v2.0 and v1.0 is small.
But with the above cases, it seems like the v2.0 model is not "completely outperform" v1.0 model.
So would you mind sharing some information about the "internal experiments" comparing these two trackers? It definitely helps a lot.
Looking forward to your reply.
Hi,
Thanks for your reply.
Hi @qsisi ,
By the way, among the checkpoints v1.0, v1.1, and v1.2, which one works best for you? I can also put it into the Readme file in case someone else may need it. I will also run another comparison between v1.x and v2.0 checkpoint.
@jytime
Hi @qsisi ,
I have updated the dataset file for MegaDepth.
@jytime
Thanks for your help! Also, when will you upload the "scene_info.npz" of megadepth? As well as the implementation of "rawcamera_to_track" function! That would be so helpful to understand the code.
Hi,
The "rawcamera_to_track" has been uploaded to https://github.com/facebookresearch/vggsfm/blob/dirty_train/dirty_train/dataset_util.py
Regarding scene_info.npz, they were generated by previous works (such as this). If I remember it correctly, they were downloaded automatically if you download megadepth from glue-factory, or if you want, you can get them by your own processing as guided here https://github.com/mihaidusmanu/d2-net
Thanks for your update.
Thanks again for your update and patience!
So you set return_track=True, select tracks with the most visibilities using top-k. i.e. nearly no non-visible tracks were sampled during training?
We never sample tracks that are invisible over all the frames, but it is possible that tracks are invisible to some of the frames, e.g., a track is visible to 5 frames, and invisible to 2 frames.
Have you found the above inaccurate "visibility" flag supervision signals harm the training? Or should we just sample tracks that is all visible to avoid the inaccurate "visibility" flags? Thanks for your advice!
No. Such inaccurate "visibility" flag is not a problem for our training.
Thanks for your reply!
I trained a vanilla tracker on MegaDepth, but its performance is far from the vgg :(
So I was wondering could you provide the data configuration yaml(whether to do cropping, RandomErasing, etc...) for MegaDepth :) It looks like the vggsfm_v5.yaml in branch 'dirty_train' is configured for Co3D.
Thanks for your help!
node_num = 1
gpu_num = 8
accelerate_args = {
"num_machines": node_num,
"multi_gpu": True,
"num_processes": gpu_num * node_num, # 8 gpus requested
"num_cpu_threads_per_process": 12,
}
hydra_config = "../cfgs/vggsfm_v5.yaml"
base_conf = OmegaConf.load(hydra_config)
# Common params
base_conf.seed = 100866
grid_param = {
"load_camera": [False],
"adapad": [False],
"pre_factor": [2],
"mixed_precision": ["bf16"],
"train.img_size": [1024],
"rot_aug": [True],
"inside_shuffle": [True],
"MODEL.ENCODER.stride": [4],
"clip_trackL": [512],
"train.mixset": ["m"], # YOU CAN ALSO SET "km", which uses kubric and megadepth
"train.erase_aug": [True],
"repeat_mix": [1],
"batch_size": [4], # "batch_size": [4, 2,],
"train.track_num": [512],
"dynamix": [True],
"train.max_images": [64],
"train.lr": [0.0001],
"train.len_train": [4096],
}
Please see this config, which trains tracker on megadepth with 8 GPUs (v2 ckpt was trained by 8 GPUs). Basically it inherits from the default config of vggsfm_v5.yaml, and uses grid_param to change some flags. We use the same config for training on kubric, or you can directly combine kubric and megadeph for training. If you still find it hard for training, please try to freeze the fine tracker in the beginning of training. Only with coarse tracker you should already be able to achieve some results that look good in human eyes.
Hi @wzds2015 , I’m not sure if I understand your question correctly. Our inference process, as demonstrated in our demo.py file, supports images of any shape. During inference, images are padded to a square and resized to a fixed resolution.
You can check our Hugging Face demo at this link for a demonstration. It works quite well on images with different shapes.
Hi @wzds2015 , I’m not sure if I understand your question correctly. Our inference process, as demonstrated in our demo.py file, supports images of any shape. During inference, images are padded to a square and resized to a fixed resolution.
You can check our Hugging Face demo at this link for a demonstration. It works quite well on images with different shapes.
Hi Jianyuan, Thanks for the response. I am reading the source code. The current source code in github repo hasn't used the padding you mentioned, right? I saw in dataloader, it forces to use crop_longest. If not the case, can you provide some code pointers in the github repo? I want to make sure correct use of the inverted camera poses for my downstreaming works. Basically I want to micmi the colmap reconstruction outputs and work on other tasks.
Hi @wzds2015 , I’m not sure if I understand your question correctly. Our inference process, as demonstrated in our demo.py file, supports images of any shape. During inference, images are padded to a square and resized to a fixed resolution. You can check our Hugging Face demo at this link for a demonstration. It works quite well on images with different shapes.
Hi Jianyuan, Thanks for the response. I am reading the source code. The current source code in github repo hasn't used the padding you mentioned, right? I saw in dataloader, it forces to use crop_longest. If not the case, can you provide some code pointers in the github repo? I want to make sure correct use of the inverted camera poses for my downstreaming works. Basically I want to micmi the colmap reconstruction outputs and work on other tasks.
Ah I see what u meant. The crop function actually can also play as padding function.
Hi @wzds2015 , I’m not sure if I understand your question correctly. Our inference process, as demonstrated in our demo.py file, supports images of any shape. During inference, images are padded to a square and resized to a fixed resolution. You can check our Hugging Face demo at this link for a demonstration. It works quite well on images with different shapes.
Hi Jianyuan, Thanks for the response. I am reading the source code. The current source code in github repo hasn't used the padding you mentioned, right? I saw in dataloader, it forces to use crop_longest. If not the case, can you provide some code pointers in the github repo? I want to make sure correct use of the inverted camera poses for my downstreaming works. Basically I want to micmi the colmap reconstruction outputs and work on other tasks.
Ah I see what u meant. The crop function actually can also play as padding function.
Yes. For example, you have an image of size (1080, 720). When you use crop_longest, the bbox would have a size of (1080, 1080), which actually pads zeros to the shorter size.
Hi @wzds2015 , I’m not sure if I understand your question correctly. Our inference process, as demonstrated in our demo.py file, supports images of any shape. During inference, images are padded to a square and resized to a fixed resolution. You can check our Hugging Face demo at this link for a demonstration. It works quite well on images with different shapes.
Hi Jianyuan, Thanks for the response. I am reading the source code. The current source code in github repo hasn't used the padding you mentioned, right? I saw in dataloader, it forces to use crop_longest. If not the case, can you provide some code pointers in the github repo? I want to make sure correct use of the inverted camera poses for my downstreaming works. Basically I want to micmi the colmap reconstruction outputs and work on other tasks.
Ah I see what u meant. The crop function actually can also play as padding function.
Yes. For example, you have an image of size (1080, 720). When you use crop_longest, the bbox would have a size of (1080, 1080), which actually pads zeros to the shorter size.
many thanks Jianyuan. Is it okay to connect on WeChat? my email: wz927@nyu.edu
Hi @jytime
https://github.com/facebookresearch/vggsfm/blob/dirty_train/dirty_train/megadepthV2.py#L85
why filter out these scenes?
Hi @wzds2015 mine is JianyuanJay
Hi @qsisi it is because some megadepth scenes (1) are used as validation set and (2) may have a bad quality. We just follow a common practice, e.g., see here https://github.com/cvg/glue-factory/tree/main/gluefactory/datasets/megadepth_scene_lists
Sorry to bother you again, looks like you are using mixed dataset (kubric and megadepth). Could you kindly release the implementation of kubric dataset?
Thanks for your help!
Hey I uploaded the code for imc, re10k, and kubric.
Thanks for your reply.
1. Do you have any plans to release the training code of the tracker? That would resolve most of the questions here. 2. Following the [metric computation](https://github.com/facebookresearch/co-tracker/blob/9ed05317b794cd177674e681321780614a65e073/cotracker/evaluation/core/evaluator.py#L35) of CoTracker, I test the performance between vgg_v1 and vgg_v2 on [tap_vid_davis_first](https://github.com/facebookresearch/co-tracker/blob/main/cotracker/datasets/tap_vid_datasets.py#L136) , here are the results: v1.0: ![image](https://private-user-images.githubusercontent.com/44374058/347295544-e1e6562c-ceb2-41b0-ad64-cddec0fc1306.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjMxNzUxOTksIm5iZiI6MTcyMzE3NDg5OSwicGF0aCI6Ii80NDM3NDA1OC8zNDcyOTU1NDQtZTFlNjU2MmMtY2ViMi00MWIwLWFkNjQtY2RkZWMwZmMxMzA2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA4MDklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwODA5VDAzNDEzOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTkyZmNjZTk2NDQ4YzMyODNkMWExYWRlNDVkMzQ5ZmE3OGMxMWUyZmIxOWU5YjA2NmVhY2MyY2U4Nzc5MDZlNTUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.J0QqY7I_UJbErAz3sfgAd7yV-N8jW0VtX_A5JiwklGY) v2.0: ![image](https://private-user-images.githubusercontent.com/44374058/347295600-2d9262fd-f4f0-4a8e-ab0b-6e9d906c0f6f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjMxNzUxOTksIm5iZiI6MTcyMzE3NDg5OSwicGF0aCI6Ii80NDM3NDA1OC8zNDcyOTU2MDAtMmQ5MjYyZmQtZjRmMC00YThlLWFiMGItNmU5ZDkwNmMwZjZmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA4MDklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwODA5VDAzNDEzOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTk0MjBjZjJjOGRkYjJiMGY0ZmZmMTNmZDcwNzc3YjI0Yzk5NjhjNWYyNGNhNzI2NDkwYTU2Y2JkMDQxNGU3YmYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.pYfQwFlcqEfwWDoZnIXM5sr9rNtr-bX5GdEwKZJPPVQ) and it seems like the v1.0 outperforms the v2.0 in terms of 'occlusion_accuracy' and 'average_jaccard'. I agree that v2.0 may have a stronger generalization ability than v1.0 in your "internal experiments". I'm just curious about what just happed in the training during these two trackers. As I said, looking forward to the release of training scripts, which would help a lot with the above questions.
Hi @qsisi Would you please share your code on how to evaluate both trackers? I need to compare this tracker with CoTracker in my research, and your code would be a huge help. Thank you in advance!
Thanks for your reply.
1. Do you have any plans to release the training code of the tracker? That would resolve most of the questions here. 2. Following the [metric computation](https://github.com/facebookresearch/co-tracker/blob/9ed05317b794cd177674e681321780614a65e073/cotracker/evaluation/core/evaluator.py#L35) of CoTracker, I test the performance between vgg_v1 and vgg_v2 on [tap_vid_davis_first](https://github.com/facebookresearch/co-tracker/blob/main/cotracker/datasets/tap_vid_datasets.py#L136) , here are the results: v1.0: ![image](https://private-user-images.githubusercontent.com/44374058/347295544-e1e6562c-ceb2-41b0-ad64-cddec0fc1306.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjMxNzUxOTksIm5iZiI6MTcyMzE3NDg5OSwicGF0aCI6Ii80NDM3NDA1OC8zNDcyOTU1NDQtZTFlNjU2MmMtY2ViMi00MWIwLWFkNjQtY2RkZWMwZmMxMzA2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA4MDklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwODA5VDAzNDEzOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTkyZmNjZTk2NDQ4YzMyODNkMWExYWRlNDVkMzQ5ZmE3OGMxMWUyZmIxOWU5YjA2NmVhY2MyY2U4Nzc5MDZlNTUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.J0QqY7I_UJbErAz3sfgAd7yV-N8jW0VtX_A5JiwklGY) v2.0: ![image](https://private-user-images.githubusercontent.com/44374058/347295600-2d9262fd-f4f0-4a8e-ab0b-6e9d906c0f6f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjMxNzUxOTksIm5iZiI6MTcyMzE3NDg5OSwicGF0aCI6Ii80NDM3NDA1OC8zNDcyOTU2MDAtMmQ5MjYyZmQtZjRmMC00YThlLWFiMGItNmU5ZDkwNmMwZjZmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA4MDklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwODA5VDAzNDEzOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTk0MjBjZjJjOGRkYjJiMGY0ZmZmMTNmZDcwNzc3YjI0Yzk5NjhjNWYyNGNhNzI2NDkwYTU2Y2JkMDQxNGU3YmYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.pYfQwFlcqEfwWDoZnIXM5sr9rNtr-bX5GdEwKZJPPVQ) and it seems like the v1.0 outperforms the v2.0 in terms of 'occlusion_accuracy' and 'average_jaccard'. I agree that v2.0 may have a stronger generalization ability than v1.0 in your "internal experiments". I'm just curious about what just happed in the training during these two trackers. As I said, looking forward to the release of training scripts, which would help a lot with the above questions.
Hi @qsisi Would you please share your code on how to evaluate both trackers? I need to compare this tracker with CoTracker in my research, and your code would be a huge help. Thank you in advance!
https://github.com/facebookresearch/vggsfm/issues/21#issuecomment-2214074932 Here's how I tested them, and only the coarse part of the tracker is included.
Hello! @jytime I have several questions about the multi-stage training, specifically, the tracker.
In my understanding, you trained the tracker on kubric first, then finetuned it on Co3D or MegaDepth depending on the test dataset. CoTracker trained its model on kubric by setting the sliding_window=8, with a total of 24 frames per sequence. In the paper which said:
So does vggsfm trained its tracker with sliding_window=24 for kubric?(which might cause huge GPU memory consumption?) Also, what about in MegaDepth or Co3D?
Looking forward to the releasing of the training code :)