FangjinhuaWang / PatchmatchNet

Official code of PatchmatchNet (CVPR 2021 Oral)
MIT License
506 stars 70 forks source link

GPU memory consumption #58

Open vevenom opened 2 years ago

vevenom commented 2 years ago

Hi,

when running PatchmatchNet on ETH3D dataset through eval.py, I end up using 15GB of GPU memory, and the paper reports 5529 MB. Could it be the case that all images for all scenes are loaded at the same time into the memory through Dataloader? Or Is there something else in the code that might be causing such large memory consumption?

I appreciate your answer, thanks.

Best, Sinisa

FangjinhuaWang commented 2 years ago

Hi,

I think you are using our default image size (2688x1792)? If so, what's your cuda version and pytorch version? I am not an expert about this, but I observe that using higher version of pytorch, e.g. 1.7.1, results in higher memory shown in nvidia-smi. I think this may relate to the internal memory management.

vevenom commented 2 years ago

Yes, I am using the default image size. The CUDA version is 11.3, and pytorch version is 1.10. I see, that is an interesting observation. I will do some analysis to get deeper understanding of this and provide an update on this if I find anything interesting.

Thanks for the quick response.

anmatako commented 2 years ago

@vevenom I noticed in the past that due to the structure of the eval script the initialization was invoked 3 times instead of just once (which is strange since I do not know where this was coming from). However, after the latest merge #57 it should only be happening once so you should be seeing the advertised 5GB consumption instead of 15GB. At least that's what I saw on my end when running the older and newer version.

atztao commented 2 years ago

Hey, this is a very good job, but can you provide your point cloud results on the DTU verification set?

anmatako commented 2 years ago

@atztao I am sharing the new point cloud retuls along with a text file with the metrics. For comparison here are the legacy point cloud results with a metrics text file as well.

atztao commented 2 years ago

@atztao I am sharing the new point cloud retuls along with a text file with the metrics. For comparison here are the legacy point cloud results with a metrics text file as well.

Thanks, but we no account in the my.sharepoint.com.

anmatako commented 2 years ago

@vevenom have you tried again using the code after merging this PR #57? Curious if you still see increased GPU memory consumption, because that's not something I see on my end now. If the issue still persists, please share some more info on the setup and data you're using, otherwise if the issue is resolved feel free to close it.

vevenom commented 2 years ago

Sorry, I was busy with other experiments. I am going to check this in the following week and provide an update if the issue persists.

vevenom commented 2 years ago

Even with #57 and the latest merge, I still observe large memory usage when using nvidia-smi. However, the allocated memory never seems to reach this number. I tried to track in the code when the large spike occurs in the code for comparison.

Interestingly, for example for living_room scene in the eth dataset, there is a spike in reserved memory but the allocated memory does not change as much for this line:

https://github.com/FangjinhuaWang/PatchmatchNet/blob/82206d8b603ec925b6e4b1990618e0ad769347de/models/net.py#L116

torch.cuda.memory_allocated: from 1.961059GB to 2.266113GB
torch.cuda.memory_reserved: from 7.986328GB to 13.800781GB
torch.cuda.max_memory_reserved: from 18.468750GB to 18.468750GB

and after separating the functions calls in this line, it seems the spike is coming from self.res . Still, when printing max memory allocated, I get torch.cuda.max_memory_allocated: 11.5GB. Earlier, there is a notable spike in reserved memory when extracting image features but the allocated memory did not change as well in this line:

https://github.com/FangjinhuaWang/PatchmatchNet/blob/82206d8b603ec925b6e4b1990618e0ad769347de/models/net.py#L51


torch.cuda.memory_allocated: 0.876828GB to 1.020382GB
torch.cuda.memory_reserved: 0.878906GB to 18.466797GB
torch.cuda.max_memory_reserved: 0.878906GB to 18.466797GB

There are some reports of such behaviors for CUDA 11, so I would be interested to see if other users with similar setup are affected by this issue as well. Again, The CUDA version I am using is 11.3, and pytorch version is 1.10

hx804722948 commented 2 years ago

The CUDA version I am using is 11.0, and pytorch version is 1.7.1 image

anmatako commented 2 years ago

@hx804722948 Can you please provide more info regarding the dataset and inputs you're using? I think Torch 1.7.1 had an issue with batch size of 2 and @FangjinhuaWang had to do a workaround to make the train loss work correctly. Running with Torch 1.9.1 there is no issue with the batch size. If what you're seeing is unrelated to the version of Torch and/or the use of batch size, I can try to reproduce your results and see if there's a bug in the code.

hx804722948 commented 2 years ago

@anmatako thank you, I will try with Torch 1.9.1. I use convert_dtu_dataset and train.py with parameters --batch_size 2 --epochs 8 --input_folder Y:\converted_dtu --train_list lists/dtu/train.txt --test_list lists/dtu/val.txt --output_folder D:/checkpoints/PatchmatchNet/dtu/Cuda110+torch1.7.1 --num_light_idx 7 It works right with Cuda101+torch1.3.0, get loss Nan with torch 1.7.1, but eval right with torch 1.7.1

hx804722948 commented 2 years ago

@anmatako thank you very much! Cuda111+torch1.9.1 work right

anmatako commented 2 years ago

@vevenom I will try to repro your issue on my end once I get some time hopefully soon. I have not seen such discrepancy running on Windows, but maybe I'm missing something in the way I'm monitoring the memory usage. It could be that I'm monitoring allocated vs reserved memory, and I'll also keep in mind the potential issue with CUDA 11. I'll let you know once I have more on that one.

ly27253 commented 2 years ago

Hello, when I run the BaseEvalMain_web.m file in Matlab, is there any GPU-accelerated .m file?

FangjinhuaWang commented 2 years ago

No, we use original evaluation codes from DTU dataset.

ly27253 commented 2 years ago

No, we use original evaluation codes from DTU dataset.

Thank you very much for your reply, thank you again for your selfless dedication, and sincerely wish you a happy life and everything goes well.