Open vevenom opened 2 years ago
Hi,
I think you are using our default image size (2688x1792)? If so, what's your cuda version and pytorch version? I am not an expert about this, but I observe that using higher version of pytorch, e.g. 1.7.1, results in higher memory shown in nvidia-smi. I think this may relate to the internal memory management.
Yes, I am using the default image size. The CUDA version is 11.3, and pytorch version is 1.10. I see, that is an interesting observation. I will do some analysis to get deeper understanding of this and provide an update on this if I find anything interesting.
Thanks for the quick response.
@vevenom I noticed in the past that due to the structure of the eval script the initialization was invoked 3 times instead of just once (which is strange since I do not know where this was coming from). However, after the latest merge #57 it should only be happening once so you should be seeing the advertised 5GB consumption instead of 15GB. At least that's what I saw on my end when running the older and newer version.
Hey, this is a very good job, but can you provide your point cloud results on the DTU verification set?
@atztao I am sharing the new point cloud retuls along with a text file with the metrics. For comparison here are the legacy point cloud results with a metrics text file as well.
@atztao I am sharing the new point cloud retuls along with a text file with the metrics. For comparison here are the legacy point cloud results with a metrics text file as well.
Thanks, but we no account in the my.sharepoint.com.
@vevenom have you tried again using the code after merging this PR #57? Curious if you still see increased GPU memory consumption, because that's not something I see on my end now. If the issue still persists, please share some more info on the setup and data you're using, otherwise if the issue is resolved feel free to close it.
Sorry, I was busy with other experiments. I am going to check this in the following week and provide an update if the issue persists.
Even with #57 and the latest merge, I still observe large memory usage when using nvidia-smi. However, the allocated memory never seems to reach this number. I tried to track in the code when the large spike occurs in the code for comparison.
Interestingly, for example for living_room scene in the eth dataset, there is a spike in reserved memory but the allocated memory does not change as much for this line:
torch.cuda.memory_allocated: from 1.961059GB to 2.266113GB
torch.cuda.memory_reserved: from 7.986328GB to 13.800781GB
torch.cuda.max_memory_reserved: from 18.468750GB to 18.468750GB
and after separating the functions calls in this line, it seems the spike is coming from self.res
. Still, when printing max memory allocated, I get torch.cuda.max_memory_allocated: 11.5GB
. Earlier, there is a notable spike in reserved memory when extracting image features but the allocated memory did not change as well in this line:
torch.cuda.memory_allocated: 0.876828GB to 1.020382GB
torch.cuda.memory_reserved: 0.878906GB to 18.466797GB
torch.cuda.max_memory_reserved: 0.878906GB to 18.466797GB
There are some reports of such behaviors for CUDA 11, so I would be interested to see if other users with similar setup are affected by this issue as well. Again, The CUDA version I am using is 11.3, and pytorch version is 1.10
The CUDA version I am using is 11.0, and pytorch version is 1.7.1
@hx804722948 Can you please provide more info regarding the dataset and inputs you're using? I think Torch 1.7.1 had an issue with batch size of 2 and @FangjinhuaWang had to do a workaround to make the train loss work correctly. Running with Torch 1.9.1 there is no issue with the batch size. If what you're seeing is unrelated to the version of Torch and/or the use of batch size, I can try to reproduce your results and see if there's a bug in the code.
@anmatako thank you, I will try with Torch 1.9.1. I use convert_dtu_dataset and train.py with parameters --batch_size 2 --epochs 8 --input_folder Y:\converted_dtu --train_list lists/dtu/train.txt --test_list lists/dtu/val.txt --output_folder D:/checkpoints/PatchmatchNet/dtu/Cuda110+torch1.7.1 --num_light_idx 7 It works right with Cuda101+torch1.3.0, get loss Nan with torch 1.7.1, but eval right with torch 1.7.1
@anmatako thank you very much! Cuda111+torch1.9.1 work right
@vevenom I will try to repro your issue on my end once I get some time hopefully soon. I have not seen such discrepancy running on Windows, but maybe I'm missing something in the way I'm monitoring the memory usage. It could be that I'm monitoring allocated vs reserved memory, and I'll also keep in mind the potential issue with CUDA 11. I'll let you know once I have more on that one.
Hello, when I run the BaseEvalMain_web.m file in Matlab, is there any GPU-accelerated .m file?
No, we use original evaluation codes from DTU dataset.
No, we use original evaluation codes from DTU dataset.
Thank you very much for your reply, thank you again for your selfless dedication, and sincerely wish you a happy life and everything goes well.
Hi,
when running PatchmatchNet on ETH3D dataset through
eval.py
, I end up using 15GB of GPU memory, and the paper reports 5529 MB. Could it be the case that all images for all scenes are loaded at the same time into the memory through Dataloader? Or Is there something else in the code that might be causing such large memory consumption?I appreciate your answer, thanks.
Best, Sinisa