Program freezes when training

bennyguo / instant-nsr-pl

Neural Surface reconstruction based on Instant-NGP. Efficient and customizable boilerplate for your research projects. Train NeuS in 10min!

MIT License

856 stars 84 forks source link

Program freezes when training #11

Open ZirongChan opened 2 years ago

ZirongChan commented 2 years ago

Hi, thanks for the great work.

As I was trying to train the synthesed drums data on your framework (for the first time), the program freezes like this: 20221108163936

I understand that some scripts would be compiled first as the code is run for the first time, but it has been like more than 2 hours still fronzen. Any advise would be very appreciated. Thanks in advance.

bennyguo commented 2 years ago

Could you please provide the following information:

GPU model & CUDA version
PyTorch version and how you install it (pip or conda)

ZirongChan commented 2 years ago

of course. Thx for your quick reply.

My GPU is GeForce GTX 1060 (poor one), CUDA 11.3 PyTorch version is 1.12.0 with py3.9_cuda11.3_cudnn8_0, which was installed via pip if I remember correctly. @bennyguo

bennyguo commented 2 years ago

Can you try to repace all the FullyFusedMLP with VanillaMLP in the config file and see if this works? If it still hangs, press ctrl+c and check the stacktrace to find where the program stucks at.

ZirongChan commented 2 years ago

No, it did not work, I got the very same log as the one I post. ctrl+c does not work either, which is even weird. I've also noticed that code copy operation was not excuted since the program stucked. Maybe I can add some info printing in the python script, where would you suggest for me to start with?

liruilong940607 commented 2 years ago

Maybe this is related: https://github.com/KAIR-BAIR/nerfacc/issues/70#issuecomment-1279782194

bennyguo commented 2 years ago

@liruilong940607 Thanks! @ZirongChan Could you try Ruilong's solution and see if it works? It still not, try to manually kill the program this time and check the stacktrace.

badarrrr commented 4 months ago

I met the same error,how did you fix it finally?