Problems about the codes corresponding to Eq(9)

CVMI-Lab / PLA

(CVPR 2023) PLA: Language-Driven Open-Vocabulary 3D Scene Understanding & (CVPR2024) RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding

Apache License 2.0

237 stars 11 forks source link

Problems about the codes corresponding to Eq(9) #49

Closed QiueY514 closed 1 day ago

QiueY514 commented 1 week ago

Thanks for your previous anwsering! When I try to understand the code with the quations in the paper (PLA), there are some questions:

As to https://github.com/CVMI-Lab/PLA/blob/3a7103a4211f6eb1f6d5c518f6cc870c26b96c52/pcseg/models/head/caption_head.py#L116-L154

What's the meaning of select_image_corr in caption_info? As presented in the paper (View-Level Point-Caption Association Section), RGB image v is back-projected to 3D space using the depth information d to get its corresponding point set. I can't find the back-project process in the code. And how to select the corresponding view images or cropped image regions for a given scene?

I would be very grateful if you could reply.

Dingry commented 1 week ago

Hi,

We have implemented an alternative method for finding 3D-2D correspondences. This approach involves projecting 3D points into 2D space and using depth images to identify the image-correspondent 3D points.

The corresponding view images are named and organized as per the raw ScanNet data. To obtain point sets corresponding to a cropped image region, first establish point-pixel correspondences, then use index masks to select the appropriate points.

QiueY514 commented 1 week ago

The corresponding view images are named and organized as per the raw ScanNet data. To obtain point sets corresponding to a cropped image region, first establish point-pixel correspondences, then use index masks to select the appropriate points.

Thanks for your reply! So I wounder if you plan to release the 3D-2D corresponding approach? Is it a type of offline method and do not conduct during training process?

Dingry commented 1 week ago

the projection code can be found here: https://github.com/CVMI-Lab/PLA/blob/main/pcseg/datasets/scannet/scannet_dataset.py#L220.

QiueY514 commented 6 days ago

the projection code can be found here: https://github.com/CVMI-Lab/PLA/blob/main/pcseg/datasets/scannet/scannet_dataset.py#L220.

Thank you for your reply! However, I only find the function named project_point_to_image, which is seemed to be used for projecting 3D (point cloud) to 2D (image). Moreover, when generating view caption_ids by generate_caption_idx.py, I find that the released view caption idxs from caption_idx/scannetv2_view_vit-gpt2_matching_idx.pickle are not equal to the generated idxs. Such as: for scene0000_00, the released view caption idx from caption_idx/scannetv2_view_vit-gpt2_matching_idx.pickle is: tensor([ 2917, 2918, 2919, ..., 25861, 25863, 25882], dtype=torch.int32), while the generated view caption idx is: tensor([ 2925, 2926, 2927, ..., 25849, 25854, 25855], dtype=torch.int32)

Looking forward to your reply.

Dingry commented 5 days ago

This function is to find the point-pixel correspondence. What other functions do you need?

The released pkl file is generated through back-projecting 2D images into 3D points and finding correspondence through NN search. However, we implement a more efficient version that is projecting 3D points to 2D images and leveraging depth images as filtering to find correspondences. The two methods will lead to different caption index. However, their performance in training an open-world learner is similar in our empirical study.

QiueY514 commented 5 days ago

This function is to find the point-pixel correspondence. What other functions do you need?

The released pkl file is generated through back-projecting 2D images into 3D points and finding correspondence through NN search. However, we implement a more efficient version that is projecting 3D points to 2D images and leveraging depth images as filtering to find correspondences. The two methods will lead to different caption index. However, their performance in training an open-world learner is similar in our empirical study.

I'm sorry for the confusion I caused earlier, and thank you again for your patient reply!