Closed LevinJ closed 11 months ago
Hi, it has been a long time to do this experiment and I could only give you some instrutions. You can have a try.
self.vertices_z = nn.Parameter(torch.zeros((self.norm_xy.shape[0], 1), device=self.norm_xy.device))
if activated_idx is None:
vertices_z = self.vertices_z
else:
activated_vertices_z = self.vertices_z["activated_idx"]
if activated_vertices_z.requires_grad:
activated_vertices_z.register_hook(clean_nan)
vertices_z = self.vertices_z.detach()
vertices_z[activated_idx] = activated_vertices_z
Hi @DRosemei , Thanks for your time and code snippets, with which I was able to implement the two aforementioned methods of mesh elevation representation, namely, 1, mesh elevation as per-vertex learnable parameters 2. mesh elevation as an MLP.
I previously assumed that while the mesh elevation learned by the first method has more jitters, it can capture higher frequency details. If this is true, then the first method can be very helpful, as they are able to optimize the mesh elevation more drastically when the initial mesh elevation is quite off from the ground truth.
The results I got suggests otherwise. The first method does have more jitters, but its ability to drastically adjust initial mesh elevation is even weaker than the second method. Below are the three experiments I did.
1. Test the second method when mesh elevation initialization is close to ground truth
The purpose of this step is to confirm that code/configurations are okay. I select scene 0655 of the nuscene dataset, a small bev size configuration to quicken the experiment process (60m * 60m), camera_height is set to its ground truth value 1.5m. Some code changes to support the switch of both methods is also made.
Below is the mesh elevation visualization right after z initialization (before training). the title of the figure show the min, max, and mean statistics of mesh elevations, which is 0.011, 0.062, 0.042. The color bar range of the figure is obtained by placing an OrthographicCamera 10 meters above the bev frame, which is from 9.95 to 9.99
Below is the mesh elevation after the training. From here we can see that now the mesh elevation is adjusted to 0.033-0.077-0.047. Also the right side is slightly closer to the camera, which conforms to our expectation.
Below is the final bev rgb, we can see that it's quite good.
2. Test the second method when mesh elevation initialization is 40cm off We know that the camera is 1.5 meters above from the ground, here I deliberately set the camera height as 1.9 meters, so the bev frame would 1.9 meter below the camra, and thus the road surface 0.4 meter above the bev frame. In other words, an ideal training would adjust the mesh elevation to around 0.4 meter.
Below is the mesh elevation visualization right after z initialization, which is what we expect.
Below is the mesh elevation after the training. After the training the mean value of mesh elevation was adjusted from 0.042 to 0.051. Mesh elevation is optimized to grow in the right direction, but is still far from the 0.4 meter ground truth.
below is the final bev rgb, we can see the crosswalk on the left side of the image is quite blurred.
3. Test the first method when mesh elevation initialization is 40cm off
Below is the mesh elevation visualization right after z initialization, which is all zero as we expected.
Below is the mesh elevation after the training. After the training the mean value of mesh elevation remains as 0, which is even worse than the second method.
below is the final bev rgb, not good at all,
From above experiments, it looks that both methods are insufficient in terms of learning mesh elevation given an incorrect initialization. This leads me to wonder what can be done to improve this behavior.
@LevinJ Thanks for your interest. In wild data, elevation initialization is important and you can refer to https://arxiv.org/pdf/2309.11754.pdf for more infomation. We use ego trajectories and sparse SfM points or dense MVS ponits to initialize road elevations.
Hi @DRosemei , Thanks for your pointer to the excellent CAMA work. Indeed, a careful mesh elevation initialization procedure (by meas of ego trajectories, sfm or mvs) before the actual mesh training can effectively solve this problem.
On the other side, if we could simply initialize mesh elevation as zero, and then somehow empower ROME to learn the elevation only based on 2D RGB and semantics supervision, the whole road surface reconstruction pipeline can be much simplified. I don't have any definite clues on how this can be achieved, or if this is even possible (though I know nerf/neus can do a decent job of learning 3d depth based on 2D image alone, without depth initialization).
On a surface level, the problem is that the output layer of HeightMLP network in ROME below can only output elevation values within a small range (like -0.2 meter -0.2 meter),
self.height_layer_1 = nn.Sequential(
nn.Linear(self.D + self.pos_channel, self.D), nn.ReLU(),
nn.Linear(self.D, self.D), nn.ReLU(),
nn.Linear(self.D, self.D), nn.ReLU(),
nn.Linear(self.D, 1),
)
Here what we wish to achieve is that HeightMLP network can output elevation values of larger range (like -10 meters to 10 meters). Looking at the ExtrinsicModel network in ROME, it tries controlling the range of translation output values by below mechanism
translations = self.configs["extrinsic"]["translation_m"]*torch.tanh(translations.unsqueeze(2))
From what I understand, tanh controls the output range to be from -1 to 1, and then multiply it by configured max value, the final output range can be explicitly set to [-max, max].
I am wondering if we can do something similar to the HeightMLP output , for example, to make the elevation range to be [-10 , 10]. Do you think it's worth trying this idea, or are there other potential solutions we could try exploring to make ROME learn elevation without careful initialization?
Hi @LevinJ. We have tried to scale HeightMLP output while it is difficult to restore fine road elevation due to week rgb&label supervision. In particular, camera sensors have various exposures which is different from official datasets in nerf/neus. So if you want to get precise elevation, traditional methods like sfm/mvs will give strong supervision. Bellow is our reconstruction results by mvs:
On a second thought, to restore ground truth depth for a pixel, Nerf/Neus do sample all possible depths between near and far bounds by utilizing numerical quadrature in its volume rendering process. Rome, on the other hand, would only try recovering ground truth depth from one fixed initialized depth. Your kind sharing above is very informative about how to obtain good results in real world data, Thanks, @DRosemei.
@LevinJ Hello, I'm curious about your first method experimental "3. Test the first method when mesh elevation initialization is 40cm off".
I understand that the SquareFlatGridRGBLabel class corresponds to the first method, and SquareFlatGridRGBLabelZ is for the second method.
So I used the SquareFlatGridRGBLabel to initialize the mesh, and the resulting elevation heatmap aligns with yours. However, after training, I noticed no changes in the elevation. Could you please share how you configured it? Or did I misunderstand something?
this my training config
# Training Parameters
waypoint_radius: 400 # meters
batch_size: 6
pos_enc: 4
lr:
vertices_rgb: 0.1
vertices_label: 0.1
# vertices_z: 0.001
rotations: 0.01
translations: 0.01
this picture is initial mesh, my OrthographicCameras set translation as np.array([-cx, -cy*2, 1]), so the zmin, zmax,zmean is based in 1
this picture is after training
Hi, in your nice Rome paper, you have below statements,
so we initiate our experiments by exploring two methods for BEV elevation representation. The first method treats BEV elevation as independent optimizable parameters, similar to RGB and semantics. The alternative one utilizes an MLP representation.
While playing with the "altenative" approach (representing elevation as an MLP), it looks that the learned heights can only vary within a very small range. So I wonder if you could kindly shed a light on what configurations/codes I need to change to experiment with the first method? Thanks.