Xinjie-Q / GaussianImage

🏠[ECCV 2024] GaussianImage: 1000 FPS Image Representation and Compression by 2D Gaussian Splatting
https://xingtongge.github.io/GaussianImage-page/
MIT License
220 stars 7 forks source link

Why can't I run this on any CUDA device except cuda:0? #10

Open ChaosAdmStudent opened 2 months ago

ChaosAdmStudent commented 2 months ago

I tried to modify the codebase a little to allow me to run it on different cuda device but I always end up with an "illegal memory access was encountered" error if I use anything other than cuda:0. Any idea why this is happening and how I can fix it?

I believe the error originates in the project_gaussians_2d function. If I use cuda:1 device and try to print xys (or any other output from this function), I get the cuda illegal memory access error. However, if I use cuda:0, they print out just fine.

Xinjie-Q commented 2 months ago

I'm curious if our original code can run on different cuda devices on your server? If you do not add CUDV_VISIBLE_DEVICE, you need to revise this code: https://github.com/Xinjie-Q/GaussianImage/blob/f06988cce9ef8a40eed847f1c8b241439eed4624/train.py#L28. In the code, we have specified that it is running on cuda:0.

ChaosAdmStudent commented 2 months ago

I'm curious if our original code can run on different cuda devices on your server? If you do not add CUDV_VISIBLE_DEVICE, you need to revise this code:

https://github.com/Xinjie-Q/GaussianImage/blob/f06988cce9ef8a40eed847f1c8b241439eed4624/train.py#L28

. In the code, we have specified that it is running on cuda:0.

In my codebase, I am just using the project_gaussians_2d and rasterize_gaussians_sum functions instead of making a SimpleTrainer2d class instance to start the training. I make sure to host all the inputs to these functions to a user-specified device but if I do anything other than cuda:0, it was initially giving me an error.

I assumed it could be because the cuda code is running on cuda:0 by default (for rendering). So I added cudaSetDevice(device_id); in the bindings.cu file for these two functions and re-compiled the package. After doing this, it started working but the code ran much much slower on the other cuda devices. After inspecting nvidia-smi, I could see that when user device input is cuda:1 or cuda:2, it still hosts some part of the script on cuda:0. I guess the slowdown is because some data is repeatedly getting communicated back and forth between cuda devices. I was wondering if I will have to add the cudaSetDevice(device_id); in every custom cuda kernel that is implemented?