cvg / nice-slam

[CVPR'22] NICE-SLAM: Neural Implicit Scalable Encoding for SLAM
https://pengsongyou.github.io/nice-slam
Apache License 2.0
1.39k stars 191 forks source link

Quick question about training time #17

Closed greeneggsandyaml closed 2 years ago

greeneggsandyaml commented 2 years ago

Hello authors, thank you for your great work. I have a very quick question: how long does it take to train a model (for example, on the Apartment scene)?

I am slightly confused because I thought the method was real-time (e.g. on an RTX 3090), so I was expecting the model to run extremely quickly. However, for the code takes many hours to run?

For context, I am using an RTX 8000 (Python 3.8, CUDA 11.1, PyTorch 1.10.0), which is slightly slower than an RTX 3090 but still quite fast. I am running the following command:

python -W ignore run.py configs/Apartment/apartment.yaml

I think I must be misunderstanding something about the method, or else something must be wrong with my setup?

Thanks for your assistance.

Zzh2000 commented 2 years ago

Hi, thanks for your interest in our work! We did not claim to be real-time, we are also looking forward to IMAP's code and seeing how they achive real-time. You can refer to #9 for more information. By the way, I heard (not necessarily true) that someone made our method real-time, but they have not released the code yet, you can also wait a bit for their code.

greeneggsandyaml commented 2 years ago

Thanks for the quick response!

To confirm, how many minutes/hours does it take to train a model for the Apartment scene with a single GPU (e.g. an RTX 3090)?

Also, I was under the impression that you were claiming to be real-time from the sentence below. I am not trying to make accusations or anything -- for my use-case, I honestly don't care whether the method is real-time or not. I am just genuinely confused about how long it takes to run, and I want to make sure that I'm not doing something totally wrong.


drawing
greeneggsandyaml commented 2 years ago

After looking at Table 4, I'm afraid I'm even more confused. Doesn't this suggest that tracking should only take a matter of milliseconds?


drawing
Zzh2000 commented 2 years ago

Real-time capable does not mean real-time, it means having a better engineering effort (e.g. iMAP release the code or like the team that already made NICE-SLAM real-time) it can be real-time. For table 4, it is under the same number of pixels and samples on a ray as the table in iMAP, you can refer to #9 for more details. In addition, please note that we already tried our best (6 months) in re-implementing iMAP, and iMAP*/NICE-SLAM shares most of the code. Our contribution is the scene representation and the insurance of local updates under constant computation, while iMAP emphasizes a real-time SLAM system with good engineering. Thanks for your understanding! It is not easy to follow up on a work that has not been open-sourced in a brand new area.

greeneggsandyaml commented 2 years ago

Thanks again for the quick response. I really appreciate your openness -- it all makes sense now.

I also understand how much hard work it takes to implement something like this from scratch. Thanks for all your work!

Once I finish training and testing a model, I'm going to leave a comment here saying how long it took (for anyone who looks at this issue in the future).

I'll create a new issue if I have any more questions, and thanks again!

endlesswho commented 2 years ago

Thanks again for the quick response. I really appreciate your openness -- it all makes sense now.

I also understand how much hard work it takes to implement something like this from scratch. Thanks for all your work!

Once I finish training and testing a model, I'm going to leave a comment here saying how long it took (for anyone who looks at this issue in the future).

I'll create a new issue if I have any more questions, and thanks again!

@greeneggsandyaml I don't know whether you finished your experinment. I run the tum rgbd datasets. It cost a lot of time. I spent about three days training fr3_xyz.

greeneggsandyaml commented 2 years ago

Hi, unfortunately I had to run something else on my GPU before the experiment finished. I'm going to run it again when my GPU resources free up. I agree -- it takes a substantial amount of time.

I'll update this when I finish the experiment!

likojack commented 1 year ago

Hey @greeneggsandyaml and @endlesswho, it also shocked me that it took so long to run the system, which is not the impression I got when reading the paper.

Feel free to try out our concurrent work on reconstruction (https://github.com/likojack/bnv_fusion). Its training time is near real-time, at around 2-5 fps with a desktop GPU (tested on 1080ti)

Kind regards, Kejie

Zzh2000 commented 1 year ago

NICE-SLAM is able to do both mapping and tracking, while BNV-Fusion needs ground truth camera poses as input. We adapt nerf style volume rendering to render depth and color images and minimize the difference compared to observed depth and color. Although the volume rendering process takes a large amount of time, this optimization-based scene encoding is more generalized and supports end-to-end tracking. Furthermore, some computation is used to encode the color information to allow novel-view synthesis and provides additional information for more robust tracking, which is not even a concern for BNV-Fusion. It is completely unfair to only compare on the mapping part.

Besides, our work focuses more on overall technical contribution rather than engineering speedup. Up till now, we have inspired tens of follow up that build upon NICE-SLAM and achieve better results or faster tracking/mapping.

Best, Zihan, Songyou

likojack commented 1 year ago

Hi Zihan and Songyou,

Thanks for the prompt reply. I didn’t question the technical contributions of NICE-SLAM at all. I was only surprised at the actual run time because it was not mentioned in the paper that it could takes hours (it took me 4-5 hours on a GTX 3090) to run a Replica sequence of 2000 frames. Furthermore, the presumption of SLAM being actually real-time does not help in this context.

I mentioned BNV Fusion because it is a mapping framework also based on neural implicit representation that runs at almost real time and it is also open source😊. People in the thread could benefit from it if they only care about 3D reconstruction and they have constraints on run time. Camera poses are easily accessible using an external real-time SLAM like ORB-SLAM.

Kind regards, Kejie