fangq / mcx

Monte Carlo eXtreme (MCX) - GPU-accelerated photon transport simulator
http://mcx.space
Other
124 stars 71 forks source link

Computer turns off when using two GPU #228

Closed Edouard2laire closed 4 weeks ago

Edouard2laire commented 1 month ago

Hello,

I am coming to report a peculiar bug. We have a computer with 2 GPU (GTX 1080 Ti ). if i use each of them separately for MCXlab, it is working well but if I try to start a simulation using both GPUs, then the computer will turn off after a few seconds.

Do you have any idea of what could be causing the issue?

Both GPUs are powered using a G 750M power (ccooler master): https://www.coolermaster.com/en-global/products/g750m/. Fron the case, it should be able to power 750W which should be enough, i think?

image

Thanks a lot Edouard

kalvdans commented 1 month ago

It sounds indeed like the power consumption. Each GPU draws 250W according to https://en.m.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units and the power supply also have to power the cpu, hard disk, fans.

Try trottling down the GPU:s using sudo nvidia-smi -pl 200

Edouard2laire commented 1 month ago

It sounds indeed like the power consumption. Each GPU draws 250W according to https://en.m.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units and the power supply also have to power the cpu, hard disk, fans.

The two GPUs would then use 500W leaving 250W for the cpu, fans ... shoudn't it be enough ? If that is the issue, how much power would you recommend?

Try trottling down the GPU:s using sudo nvidia-smi -pl 200

Thx. i'll try and let you know if that works.

Edouard

fangq commented 1 month ago

@Edouard2laire, I too suspect it is related to power issue, not a bug in mcx, but I also acknowledge that I have a few boxes using 750W to drive comparable GPUs with no issue.

I suspect that the issue might be caused by degenerated performance from either the power supply or your GPUs, especially this config has been working in the past - cleaning dust and ensuring sufficient air flow and fan speed could be one way to verify. You should use nvtop to see the power draw and temperature of your GPU when running simulations on individual GPU and see if you can capture any abnormality; you could also use sensors command to see GPU/chassis fan speed and temperature. if system has a crash, you can look at the system logs under /var/log.

replacing or upgrading your powersupply would be a more effective step if debugging won't reveal anything meaningful.

fangq commented 4 weeks ago

I am closing this ticket. if something related to mcx is identified, please feel free to reopen.