OpenDriveLab / ViDAR

[CVPR 2024 Highlight] Visual Point Cloud Forecasting
https://arxiv.org/abs/2312.17655
Apache License 2.0
278 stars 18 forks source link

In Eq5, BEV feature expectation function is 2 dim?? , but conditional probability is 3-dim. how to calculate final Feature for BEV?? #1

Closed SPA-junghokim closed 9 months ago

SPA-junghokim commented 10 months ago

I am writing to express my sincere appreciation for your excellent research. Your work has been incredibly insightful and has sparked my curiosity in several areas.

I have a question regarding Equation 5 in your paper. I noticed that Equations 3 and 4 are computed in a three-dimensional space (x, y, z). Similarly, I am curious about how Equation 5 is calculated. Given that the conditional probability is three-dimensional while the Bird's Eye View (BEV) feature is two-dimensional, I am assuming there must be a method to reduce the conditional probability to two dimensions. However, I could not clearly understand the calculation method from the supplementary materials. In the section below Equation 8, there is a mention of g(i) = {xi, yi}, which appears to be in two dimensions. Could you please clarify how this computation is achieved?

Additionally, I am finding it challenging to interpret the meaning of a statement related to Equation 6: "The ray-wise features are shared by all grids lying in the same ray." Could you kindly provide a more detailed explanation of what this implies in the context of your research?

Additionally, the loss function you mentioned in eq7 is not included by group. Is voxel occupancy calculated independently by group??

Thank you in advance for your time and assistance in clarifying these points. Your insights will be invaluable in furthering my understanding of this subject.

tomztyang commented 10 months ago

Hi, thanks for your interest in our work.

For Q1, we compute the conditional probability in 2-dimension (x- and y-axis) given the BEV features are two-dimensional. We follow the similar form of Eq.3 and Eq.4 but calculate that in 2-dimension.

For Q2. please find that in Fig. 8 (Step-c). As shown, after step-3 (Feature Expectation Function), the \hat{F} is a ray-wise features showing the same response for all grids lying on the same ray.

For Q3. the loss function is the ray-wise cross-entropy loss for each ground-truth point. It is calculated on the obtained BEV feature map.

Best, Zetong

SPA-junghokim commented 10 months ago

Thank you so much for your reply.

As for Q1, Q3 answer, is BEV's occupancy probability output also 2D for conditional probability calculation? For this 2D occupancy probability, is it calculated as loss in 2d GT? If so, how do I get 2d occupancy probability ground truth in 3d LiDAR point cloud?

Thank you. Jungho Kim

tomztyang commented 10 months ago

No, we just use point cloud as the supervision. Given each ground-truth point, we cast a ray, and compute ray-wise cross-entropy loss.

Best, Zetong

SepJourney commented 10 months ago

@tomztyang Hi, thanks for sharing this great work! I also have some questions about latent rendering.

You sample j points between the origin and i-th bev grid to calculate conditional probability for i-th grid. Eq10 is for conditional probability, given the conditional probability \hat{p} for all bev grids, Q1: why don't you just multiply it by the bev features (\hat{F}_bev = \hat{p} * F_bev) to achieve BEV grids highlight. Q2: I don't understand Eq11, should I resample k points (k points and j points are not the same?) along the ray of each bev grid to get ray-wise feature \hat{F} first, then multiply it by conditional probability? k points should also located between the origin and i-th bev grid? Q3: What numbers are used for k and j in your experiments?

tomztyang commented 10 months ago

Hi @SepJourney ,

Thanks for your insightful questions!

For Q1, we have conducted experiments on this. The performance is almost the same if we use \hat{F}_bev = \hat{p} * F_bev (47.58 v.s. 47.34). Please see Table 12, results of conditional probability only row compared to results of conditional probability with feature expectation function.

For Q2. k points are all points lying on the same ray with i-th bev grid. So that, after Eq.11, we can get the ray-wise feature (same response for all grids lying on the same ray as \hat{F} in Figure 7)

For Q3. they are all 256 points. We sample 256 points along each ray with an interval of 1, and ignore those points out of the BEV feature map (for k) or with the distance longer than i-th bev grid to the original point (for j).

Best, Zetong

SepJourney commented 10 months ago

Hi @SepJourney ,

Thanks for your insightful questions!

For Q1, we have conducted experiments on this. The performance is almost the same if we use \hat{F}_bev = \hat{p} * F_bev (47.58 v.s. 47.34). Please see Table 12, results of conditional probability only row compared to results of conditional probability with feature expectation function.

For Q2. k points are all points lying on the same ray with i-th bev grid. So that, after Eq.11, we can get the ray-wise feature (same response for all grids lying on the same ray as \hat{F} in Figure 7)

For Q3. they are all 256 points. We sample 256 points along each ray with an interval of 1, and ignore those points out of the BEV feature map (for k) or with the distance longer than i-th bev grid to the original point (for j).

Best, Zetong

Thanks for your reply! It helps me understand the process. It seems the expectation-wise implementation is like an extended ray-wise self-attention layer.