TRI-ML / sdflabel

Official PyTorch implementation of CVPR 2020 oral "Autolabeling 3D Objects With Differentiable Rendering of SDF Shape Priors"
MIT License
161 stars 19 forks source link

Question related to Ablation study & CSS Net five layers freeze #7

Open taeyeopl opened 3 years ago

taeyeopl commented 3 years ago

Thanks for sharing the great work! I have two simple questions related to the ablation study & CSS Net freeze part.

Q1. Can you explain the difference between (R,t) / (R,t),s / (R,t),s,z, which is in the main paper tab3??

image

Due to a limit of my understanding, it is hard to understand the difference clearly included implementation. Is it for making a label or for variables in deep sdf training?? I'm curious because I can't find a place where the all [(R,t), s, and z] parts are affected in your code. https://github.com/TRI-ML/sdflabel/blob/416c27dbcc6a341c3038783beb46d3f8eccb8177/utils/refinement.py#L501

Q2. As in the code the conv1, bn1, and layer1 were frozen, Can you explain how to count the number of layers (5)?? I saw that in supplementary C.1. CSS Net, "the first five layers are frozen in order to prevent overfitting to peculiarities of the rendered data".

https://github.com/TRI-ML/sdflabel/blob/416c27dbcc6a341c3038783beb46d3f8eccb8177/networks/resnet_css.py#L156

xmyqsh commented 3 years ago

A1: R(rotation), t(translation), s(scale), z(shape latent code[3 dim in this paper]) R,t,s can be estimated by 3D-3D correspondence estimation. The one 3D points is the back-projected Lidar Frustum Points from NOCS. The other 3d points is the DeepSDF rendering model(which is normalized and centered, which is just like sampling on CAD model) points. Because of the 1-to-1 correspondence property of NOCS(2d map) and DeepSDF rendering model points, we can sample some of correspondence pairs, then the 3D-3D correspondence estimation can be solved by kabsch or procrustes algorithm.

z is the conditioned latent vector of DeepSDF which can be calculated for each SDF shape model by MAP in autodecode in DeepSDF. The process of MAP in autodecode is expensive. So, the MAP result z will be saved as css label. And then z could be predicted by css_net. And then z can be used from the conditioned input for DeepSDF.

def get_kitti_label(dsdf, grid, latent, scale, trans, yaw, p_WC, bbox): The latent is generated by the MAP as described above. But the related code is not in this repo. It should be in the author's other repo.

A2: the first five layers means first five conv layers 1(self.conv1) + 4(self.layer1: for resnet18 layer1 who has 2 block consisted by 2 conv) = 5

There is some bug in the freeze code. See the PR https://github.com/TRI-ML/sdflabel/pull/8 for more detail.

taeyeopl commented 3 years ago

Thanks for the explanation! I understood each component but still have some misunderstandings. Sorry for my poor understanding.

Q1. Can you explain each experimental setting clearly?? It would be really helpful to understand the ablation. As I understood, based on the equation, image

  1. [setting 1] (R,t) means does not multiply s, only using R, t, to transform DeepSDF rendering points to the Lidar coordinate.
  2. [setting 2] (R,t,s) means consider same as equation (10),
  3. [setting 3] (R,t,s,z) -> z part is quite hard to understand because It seems like a must-have for optimization. Can you explain the difference without/with the z?? image
xmyqsh commented 3 years ago

The setting 3 is the default setting of this repo. And setting 1 and setting 2 are not supported currently. The functionality of conditioned latent code z is that we can use one DSDF for all model shape, instead of one model shape one DSDF. I cannot imagine the setting 1 and setting 2 without the paper detail or code detail currently. We need the author to have a explain for setting 1 and setting 2. @zakharos

taeyeopl commented 3 years ago

I think it is not desirable to compare z for all model(single class, car) shapes and one model shape with one DSDF as an ablation study. Because the original DeepSDF considers all models (single class, car). It could make sense if your single model covers all models with multi-class(car, bike, etc). And, the driving scenario can't adopt one model shape one DSDF. It would be challenging to make models for all cars.

Nevertheless, I would appreciate it if you could explain each setting in order to have a clear understanding of the ablation study. @zakharos

zakharos commented 2 years ago

Hi @taeyeop-lee! I apologize for the delay! Please find the answers to your questions below:

Ablation setup The goal of the ablation is to demonstrate how different components of the pipeline affect the final downstream performance (detection). In particular, the 3 settings you are referring to demonstrate how different optimization variables - R (rotation), t (translation), s (scale), and z (latent shape code) - affect the end performance.

From the results in Table 3, we see that setting 3 results in best overall perfomance.

Frozen layers @xmyqsh is absolutely right in describing how frozen layers are computed.

xmyqsh commented 2 years ago

@zakharos Actually, my doubt is on the [setting 1], what the value of the initial s scale is? What is the range of scale s? [0~inf]? I don't think it can get such a good result without optimizing it. @taeyeop-lee Good suggestion. Further experiment is needed to verify it.