HKUST-3DV / DIM-SLAM

This is official repo for ICLR 2023 Paper "DENSE RGB SLAM WITH NEURAL IMPLICIT MAPS"
195 stars 11 forks source link

about the system branch #17

Closed llianxu closed 7 months ago

llianxu commented 7 months ago

Hello! Thank you very much for open-sourcing the relevant system implementation, I would like to ask if this code implementation can achieve the effect of the paper in terms of pose accuracy. The readme only said that the mesh was not very good, and I ran around and found that the pose accuracy was not very good enough, so I asked

poptree commented 7 months ago

Hi, I tested the code with the default config. It should achieve a similar performance on pose accuracy with the paper. eg, 0.8 on office0. Let me know if you cannot reproduce the result, so I can check the config later.

poptree commented 7 months ago

Hi, If you mentioned the plot of the rmse during training, that one compares the estimated pose with the gt WITHOUT scale correction. The final result is compared with GT with scale correction, as we mentioned in the paper. you can use the evo to evaluate the final trajectory.

llianxu commented 7 months ago

Thank you for your prompt response. But I think the results of init should be consistent? In the end, I tested init and it failed Additionally, I noticed that the depth map effect of this version is poor. There are also some minor bugs, such as the dimslam class not having a start method

poptree commented 7 months ago

Hi, Oh, thank you for the reminder. I tested some experiments on this code version yesterday, like initialization without setting the first two camera poses to the GT one. But I forgot to redo the modification (in init function of dimslam.py). I will push the correct one later. Again, thank you for your reminder.

poptree commented 7 months ago

Hi, I pushed the correct version now. You can test it and let me know if it is still performing poorly.

llianxu commented 7 months ago

That's very polite, but I still have a question, as you mentioned, we have fixed the poses of the first two images, so do the final estimated poses need to be corrected? The results on the main branch don't seem to need to be corrected

llianxu commented 7 months ago

Hi, I pushed the correct version now. You can test it and let me know if it is still performing poorly.

I'll test it out as soon as possible, thank you very much

poptree commented 7 months ago

Hi, In the RGB VO/SLAM system, the scale of the camera pose will sightly change during tracking, and this change is inevitable. Thus, the slam system usually performs a loop closure or SIM(3) BA to enforce both the scene point and the camera pose to be consistent on the whole sequence (under the same global scale). Suppose the SIM(3) ba or loop closure is missing. In that case, we can only do a scale correction to evaluate the final pose, using a global scale to minimize the discrepancy between the estimated and the GT one. It is a common and important practice in mono RGB slam and vo. How to make the scale shift as small as possible is also a topic in the mono rgb slam/vo area because it somehow reflects the robustness of your mono RGB slam/vo system. BTW, "up to a scale" and "scale shifting" are two problems. We align the first two poses with GT to avoid "up to a scale" problem. We use scale correction during pose evaluation to minimize the influence of "scale shifting" because there is no such "loop closure" or "SIM(3) BA" in the Nerf-based SLAM/VO system yet.

For the initialization, you can also enable scale correction while the results will look the same since the scale shifting is too tiny to observe.

If the input is an RGB-D sequence, the pose scale will not change due to the constraints from the depth map, which means the results will be very similar with/without the scale correction.

llianxu commented 7 months ago

Thank you for your detailed answer. I ran your newly uploaded program, but the results were still very poor. I would like to know if the depth map effect was the same when you tested it,This is the result of init phase 715 iter image

poptree commented 7 months ago

Hi,

I fixed another bug in sfm.py, the output of initialization should be the same with the main branch now.

llianxu commented 7 months ago

Hi, But I have tested adjusting the weight of the fix cam back to 50 before, but it still doesn't seem to work. I'll test it again

llianxu commented 7 months ago

it still doesn't seem to work This is the result of init phase 715 iter。I checked and found that compared to the main branch, fake depth loss and the learning rate of grid are different. image

poptree commented 7 months ago

Hi, Here is the output at iteration 705 of the current version. Are you sure the code you run is up to date?

keyframe_ape

0770

llianxu commented 7 months ago

Thank you so much I'll test it again, I'm a bit confused

poptree commented 7 months ago

Hi,

Let me know if it still does not work after fetching the latest update. It might lead to some environment issues.

llianxu commented 7 months ago

Hi, I am also thinking about this question because I am not using the official environment of Nice Slam environment. Would you like to ask if you are using the official version of Nice Slam environment

poptree commented 7 months ago

Hi,

Not exactly. It could be an issue with the version of PyTorch since it is the only core package this reimplement repo relies on. For reference, you can find the requirement.txt of my dev environments. I have tested on pytorch 1.10-2.1.0; all these version performs well. requirements.txt

llianxu commented 7 months ago

Hi, I have re copied the code and used the nice slam environment, and now the results should be correct. I'll test my previous environment again, it's a bit strange

poptree commented 7 months ago

Hi, Sure. Let me know if you have any other questions.

llianxu commented 7 months ago

OK! Thank you for your patient answer for such a long time

llianxu commented 7 months ago

Hi, @poptree I have found the problem. That is mathutils and scipy's difference in quaternion. And I have a new question. In the paper, it was mentioned that global keyframes remain unchanged during the tracking phase, but the current implementation will change. Will this have an impact on the final result

poptree commented 7 months ago

It will. you should fix the global frame in the tracker as the mapper. fix_num=len(gloabalframe)

the two-thread version of this implementation has a similar camera pose performance compared to 1 thread version. In my original implementation, the two-thread version was slightly inferior to one-thread.

llianxu commented 7 months ago

OK! I see, thank you very much!

llianxu commented 7 months ago

Hi, @poptree I'm sorry to trouble you again. Because I want to propose my own innovation based on the work of DIM SLAM, which is a pure RGB SLAM, I would like to supplement the relevant keyframe filtering strategy on this basis. I understand that this strategy should be based on the optical flow judgment mentioned in the paper, and currently, we choose one frame every five frames as the keyframe. I have simply written the code, roughly this logic? Because finishing the race was too slow, I asked in advance

 ### update keyframes based on flow
            import cv2
            last_global_keyframe = self.global_keyframe_dict[-1]
            last_global_keyframe_img = last_global_keyframe['color'].cpu().numpy()
            last_global_keyframe_img = (last_global_keyframe_img * 255).astype(np.uint8)
            last_global_keyframe_img = cv2.cvtColor(last_global_keyframe_img, cv2.COLOR_RGB2GRAY)
            first_local_window_img = self.local_window_dict[0]['color'].cpu().numpy()
            first_local_window_img = (first_local_window_img * 255).astype(np.uint8)
            first_local_window_img = cv2.cvtColor(first_local_window_img, cv2.COLOR_RGB2GRAY)
            detector = cv2.FastFeatureDetector.create()
            last_global_keyframe_img_kp = detector.detect(last_global_keyframe_img, None)
            last_global_keyframe_img_kp = np.array(
                [kp.pt for kp in last_global_keyframe_img_kp], dtype=np.float32
            ).reshape(-1, 2)
            lk_params = dict(winSize=(15, 15), maxLevel=2, 
                             criteria=(cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT, 10, 0.03))  
            first_local_window_kp, status, _ = cv2.calcOpticalFlowPyrLK(last_global_keyframe_img, first_local_window_img, last_global_keyframe_img_kp, None, **lk_params)
            status = status.astype(np.bool8).squeeze()
            last_global_keyframe_img_kp = last_global_keyframe_img_kp[status]
            first_local_window_kp = first_local_window_kp[status]
            flow = first_local_window_kp - last_global_keyframe_img_kp
            flow_MSE = np.mean(np.sum(flow**2, axis=1))
            if flow_MSE > 10:
                self.global_keyframe_list.append(frame_idx)
                self.global_keyframe_dict.append(self.local_window_dict[0])
            else:
                self.remainder_list.append(frame_idx)
                self.remainder_dict.append(self.local_window_dict[0])
            self.local_window_list.pop(0)
            self.local_window_dict.pop(0

At the same time, the current speed is indeed too slow. I would like to know which part you parallelized using CUDA to achieve acceleration, and I also want to implement it myself.

Thank you very much. I wish you all the best

poptree commented 7 months ago

Hi,

A more straightforward and alternative solution is to reuse the keyframe selection part in NICE-SLAM.

Knowing the camera pose and the depth map, you can already compute the scene flow. You will get the optical flow by projecting the depth map to another frame.

A more simpler answer is: Sampling the points, e.g., 100 points, from the last frame, render the depth, and then project it to the previous keyframe. Computing the geometric error (not the reprojection error ) for the pixel and the projected one. The distance is also called optical flow in this case.

For the cuda, I wrote most of the optimization part in "sfm" for testing the time complexity. The most critical parts are the sampling and the warping loss. These parts take the most time consumption in Python. For sampling and inference, you can refer to nerfacc, while I implement cuda warping myself.

llianxu commented 7 months ago

OK,I see. I understand that your geometric loss is the distance of the corresponding point in 3d space, and projection loss is the pixel distance,right?

poptree commented 7 months ago

Hi,

The geometric error is P(I') - P(I), where P is the location of a pixel on images. I is the original pixel on image x, and you can project it to another image y with rendered depth d and the relative camera pose T_{xy} to obtain the projected pixel I'.

llianxu commented 7 months ago

OK I see. thank you very much!