Closed QiueY514 closed 1 day ago
Hi,
We have implemented an alternative method for finding 3D-2D correspondences. This approach involves projecting 3D points into 2D space and using depth images to identify the image-correspondent 3D points.
The corresponding view images are named and organized as per the raw ScanNet data. To obtain point sets corresponding to a cropped image region, first establish point-pixel correspondences, then use index masks to select the appropriate points.
The corresponding view images are named and organized as per the raw ScanNet data. To obtain point sets corresponding to a cropped image region, first establish point-pixel correspondences, then use index masks to select the appropriate points.
Thanks for your reply! So I wounder if you plan to release the 3D-2D corresponding approach? Is it a type of offline method and do not conduct during training process?
the projection code can be found here: https://github.com/CVMI-Lab/PLA/blob/main/pcseg/datasets/scannet/scannet_dataset.py#L220.
the projection code can be found here: https://github.com/CVMI-Lab/PLA/blob/main/pcseg/datasets/scannet/scannet_dataset.py#L220.
Thank you for your reply! However, I only find the function named project_point_to_image
, which is seemed to be used for projecting 3D (point cloud) to 2D (image). Moreover, when generating view caption_ids by generate_caption_idx.py
, I find that the released view caption idxs from caption_idx/scannetv2_view_vit-gpt2_matching_idx.pickle
are not equal to the generated idxs.
Such as:
for scene0000_00
, the released view caption idx from caption_idx/scannetv2_view_vit-gpt2_matching_idx.pickle
is:
tensor([ 2917, 2918, 2919, ..., 25861, 25863, 25882], dtype=torch.int32)
,
while the generated view caption idx is:
tensor([ 2925, 2926, 2927, ..., 25849, 25854, 25855], dtype=torch.int32)
Looking forward to your reply.
This function is to find the point-pixel correspondence. What other functions do you need?
The released pkl file is generated through back-projecting 2D images into 3D points and finding correspondence through NN search. However, we implement a more efficient version that is projecting 3D points to 2D images and leveraging depth images as filtering to find correspondences. The two methods will lead to different caption index. However, their performance in training an open-world learner is similar in our empirical study.
This function is to find the point-pixel correspondence. What other functions do you need?
The released pkl file is generated through back-projecting 2D images into 3D points and finding correspondence through NN search. However, we implement a more efficient version that is projecting 3D points to 2D images and leveraging depth images as filtering to find correspondences. The two methods will lead to different caption index. However, their performance in training an open-world learner is similar in our empirical study.
I'm sorry for the confusion I caused earlier, and thank you again for your patient reply!
Thanks for your previous anwsering! When I try to understand the code with the quations in the paper (PLA), there are some questions:
As to https://github.com/CVMI-Lab/PLA/blob/3a7103a4211f6eb1f6d5c518f6cc870c26b96c52/pcseg/models/head/caption_head.py#L116-L154
What's the meaning of
select_image_corr
incaption_info
? As presented in the paper (View-Level Point-Caption Association Section), RGB image v is back-projected to 3D space using the depth information d to get its corresponding point set. I can't find the back-project process in the code. And how to select the corresponding view images or cropped image regions for a given scene?I would be very grateful if you could reply.