PRBonn / 4DMOS

Receding Moving Object Segmentation in 3D LiDAR Data Using Sparse 4D Convolutions (RAL 2022)
https://www.ipb.uni-bonn.de/wp-content/papercite-data/pdf/mersch2022ral.pdf
MIT License
272 stars 28 forks source link

Performane on trained weights #30

Closed haider8645 closed 1 year ago

haider8645 commented 1 year ago

Hello there,

I am playing back a rosbag file for the kitti sequence 14 at a rate of 0.1. Turns out that for me, the best performing weights are the 5_scans_no_poses. I transform all 5 scans to the world frame before feeding to the model. Could you give a hint into what I could debug?

Please see pictures below of the 5 scans visualized together in the world frame. We can see the trailing dynamic points clearly. The trail is clearly detected with the 5_scans_no_poses.ckpt and not with the 5_scans.ckpt. I get similar results for 2 and10 window weights.

5_scans_no_poses.ckpt image

5_scans.ckpt image

Best regards, Haider

haider8645 commented 1 year ago

I notice that in your work you locally align the clouds and not globally. Not sure if this could be the reason for the degraded performance. I will locally align and check :)

haider8645 commented 1 year ago

No improvement even if I transform past scans to the current frame

benemer commented 1 year ago

Hey!

The distance to the common coordinate frame should not matter, so it can either be a local one (for example the sensor origin at any point in time) or a global one.

Therefore, I am surprised about your results. They should actually be the other way around. How exactly do you feed the scans from the bagfile to 4DMOS?

Best Benedikt

haider8645 commented 1 year ago

Hello!

I publish the tf and pointcloud topics from the bag using rosbag2. The 4dmos is wrapped in a python ros2 node in humble and it subscribes to the pointcloud2 topic. I fill the sequence of samples and add a time dimension to each point and save all 4d samples as a tensor list and feed it to the model. Afterwards, I read the activations of softmax layer to get the confidences.

I noticed that you use a very high threshold for a point to be marked as dynamic. You use confidence > 0.5 but for me the confidences are really really small somehow. Could it be that the softmax is normalizing the scores to 1 for the 100000s points and this means that each point gets a small score? but I guess each point has a softmax so the confidence should be high. I use the raw softmax activations and do not aggregate using the previous confidences like you do with the bayes filter. If I use a very low threshold e.g. confidence > 0.05 then I get good results. Not sure why though my confidences are so low. Previously, I was using 0.5 as a threshold, thus very few dynamic points were detected.

Confidences sores of the softmax layer (last column which stores the scores for the class dynamic points): [0.00567781 0.0041373 0.00374707 ... 0.00003613 0.00003613 0.00010932] Do you have any idea why? So now I use a confidence > 0.05 threshold for the dynamic points

Some results using 0.1 scan time diff and 10_scans.ckpt

Red are dynamic points image

Another example of good results with low confidence scores:

Before removal image After Removal image

benemer commented 1 year ago

I fill the sequence of samples and add a time dimension to each point and save all 4d samples as a tensor list and feed it to the model.

Could you please share a code snippet of this part? Just to make sure that the sparse tensor is composed correctly for 4DMOS.

Could it be that the softmax is normalizing the scores to 1 for the 100000s points and this means that each point gets a small score?

This could be the reason, but I need to see the code to comment on this. As seen here, we apply the softmax for each point individually.

I use the raw softmax activations and do not aggregate using the previous confidences like you do with the bayes filter.

This should be fine. The Bayes filter increases the robustness, but it should also give good results without.

haider8645 commented 1 year ago

The model params are:

    self.cfg["TRAIN"]["BATCH_SIZE"] = 1
    self.cfg["MODEL"]["DELTA_T_PREDICTION"] = 0.1
    self.cfg["MODEL"]["N_PAST_STEPS"] = 10
    self.cfg["DATA"]["VOXEL_SIZE"] = 0.1

So when I get a new pointcloud, I iterate over it and append the time to it. For the case of 10 scans, the self.time parameter starts at -0.9. I just sum 0.1 to the previous time when a new message arrives and so on. After my window of 10 scans is complete, I pass these to the model.

so my window is always -0.9,-0.8 ... 0.0. After the window is full and a new scan arrives then I remove the sample with time -0.9 and change all remaining samples to -0.9,-0.8 ... , 0.1 and then add the new sample at 0.0. This is probably not needed and I can just have the next window from -0.8 to 0.1 where 0.1 is the newest sample

# Repeat everytime a new msg arrives until window is filled
def align_messages(self):   
    cloud_points = []
    for p in pc2.read_points(self.messages[-1], field_names = ("x", "y", "z"), skip_nans=True):
        array = [p[0],p[1],p[2]]
        p_t = np.append(array,self.time)
        cloud_points.append(p_t)
    matching_transform = None

#Fill the window of points 
self.points.append(torch.from_numpy(cloud_points))

After the window is filled, I conver t the cloud samples in points to tensors.

cuda_tensor_list = [tensor.to(torch.float32).to('cuda') for tensor in self.points]

prediction

   with torch.no_grad():
       out = self.model.forward(cuda_tensor_list)              

I notice that with BATCH_SIZE=1, the feature_at() returns features for the samples in the window in the same order as the are in the cuda_tensor_list. Since in my window of size 10, the 9th index holds the current sample, so I can get those features directly like this.

       logits = out.features_at(9) 
       coords = out.coordinates_at(x)

I know you suggested to use the mask and match the sample time which would be # mask = coords[:, -1].isclose(torch.tensor(0)) but this gives not good results for me

       # mask = coords[:, -1].isclose(torch.tensor(t))
       #masked_logits = logits[mask]
       logits[:, self.ignore_index] = -float("inf")
       pred_softmax = F.softmax(logits, dim=1)
       pred_softmax = pred_softmax.detach().cpu().numpy()
       moving_confidence = pred_softmax[:, 2]
benemer commented 1 year ago

I found the issue!

   logits = out.features_at(9) 

This is definitely wrong, because features_at returns the features at a batch index. This means that your input is currently not a single batch containing 10 timestamps, but 10 batches with just a single timestamp per scan. With this, 4DMOS inferred the moving objects on each scan individually, without doing the convolutions across time.

I actually explained it wrong here, where I said

In your example, cuda_tensor_list is expected to be a list of 10 tensors of size [N_i,4], where N_i is number of points of a scan i and 4 are the 4D coordinates x,y,z,t.

which should be

In your example, cuda_tensor_list is expected to be a list of batches, with each batch being a single tensor of size [N,4], where N is number of points in the sequence and 4 are the 4D coordinates x,y,z,t.

Sorry for the confusion!

benemer commented 1 year ago

I just updated my reply in the other discussion to avoid confusion in the future.

After passing the cuda_tensor_list as a list containing a torch tensor as a single batch at index 0, you can get the features with logits = out.features_at(0) and mask the prediction for a specific timestamp using the time channel from the coordinates coords = out.coordinates_at(0).

Let me know if that worked for you and solved the issue with the 5_scans.ckpt model as well as the low scores.

benemer commented 1 year ago

By the way, we just released MapMOS which improves upon 4DMOS. Just in case you need another baseline :)

haider8645 commented 1 year ago

If I concatenate all the points in the sequence as a single tensor and then send it as a list containing just 1 tensor to the model then the confidence scores are high but I start to get alot of false positives :(

For the transformation of the points to world, I am using the velo_link to world transforms. Could these false positives be due to the localization errors? or did you use some other pose samples to transform the points to a common frame of reference.

        # self.points is a list containing 10 pointcloud samples with time diff 0.1
        total_points = np.concatenate(self.points)
        cuda_tensor_list = []
        input = torch.from_numpy(total_points).to(torch.float32).to('cuda')
        cuda_tensor_list.append(input)

White are static, colored are dynamic with confidence > 0.5 image

benemer commented 1 year ago

Does this happen throughout the whole sequence? Or just in this case? To me this looks like a pose issue. How are you registering the scans?

In the lower image of your post here you can see how the wall is not properly aligned. To 4DMOS this looks like the wall is moving and it will be predicted as moving.

To visualize, you could colorize the input scans with respect to the timestamp and see if it drifts.

haider8645 commented 1 year ago

Yes you are right. It was actually the error in the pose estimation. The error happens because the transforms for the frames on the /tf topic are delayed as compared to the /kitti/velo/pointcloud. I am not doing any timestamp based synchronization at the moment so the transformation to world is wrong for the scans. This results in the misalignment. I will look into it and update you if that was really the issue.

About 4DMos and the error in the pose estimation: How much of a misalignment is tolerated by 4dMos in between two scans before it starts to mark static objects as dynamic? Would it help if the confidence score is not constant but scales with some other parameters which might affect the alignment of the scans even though the objects were static?

Btw MapMos looks great. I tested it out and everything work out-of-the-box. Thanks for that!

benemer commented 1 year ago

It is hard to say how much is tolerated. In theory, if the misalignment is larger than the voxel size used for discretization (10 cm in 4DMOS), it could be detected. If the misalignment is smaller, the points from different scans end up in the same spatial voxel, so no motion can be seen from the sparse 4D tensor.

Besides increasing the threshold for the confidence score to suppress the false positives, you could try to increase the voxel size (for example 0.2m or even more). Note that this could reduce the amount of moving objects you will segment, because the motion will be less visible in the sparse tensor, especially for slowly moving objects.

Btw MapMos looks great. I tested it out and everything work out-of-the-box. Thanks for that!

Thank you very much! Happy to hear that :)