fangq / mcx

Monte Carlo eXtreme (MCX) - GPU-accelerated photon transport simulator
http://mcx.space
Other
133 stars 73 forks source link

Cannot use multi-GPU properly in MCXLAB #88

Closed kaoben2731 closed 4 years ago

kaoben2731 commented 4 years ago

I have a computer with 2 GPU, and I have tried the mcx_gpu_benchmarks.m on it. By setting the cfg.gpuid='11', it should enable me to use both GPU. But the simulation result is weird, it says 'output absorption fraction is incorrect', and I found that the number of detected photon is less when both GPU are used. Also, if I distribute more workload to GPU1, then the number of detected photon will be more. By the way, if i set the cfg.gpuid='01' or '10', both GPU can work properly independently.

My OS is win10, using Matlab R2018b MCXLAB v2019.4 Nvidia driver 441.87 1st GPU is GTX 750ti 2nd GPU is GTX 660ti

Is there any clue for solving this problem? Thanks.

fangq commented 4 years ago

that script contains 3 benchmarks, which one gave you this error, and what's was the absorption fraction? did you change the total phone number or run 10^8 per benchmark as the default setting?

try nightly build from http://mcx.space/nightly and see if there is any difference.

kaoben2731 commented 4 years ago

I got the error in the 1st benchmark. I found that the absorption fraction is correct, but the flux.stat.energytot is much smaller than flux.stat.nphoton, so it gives the error. I didn't change the setting of the benchmark.

I tried the nightly build version, but the error still exist. The following is the simulation result and error message. image

It seems that it didn't get the result from the 2nd GPU. Besides change the [cfg.gpuid=1;] into [cfg.gpuid='11';], is there anything else I should edit to enable 2 GPU?

Thanks for your replying.

fangq commented 4 years ago

looks like the execution on the 660Ti has failed, and returned without results. I am not sure what happened, it may be an error on the driver side.

Besides change the [cfg.gpuid=1;] into [cfg.gpuid='11';], is there anything else I should edit to enable 2 GPU?

no, that's all you need.

fangq commented 4 years ago

also, do you run this on a windows machine? can you double check your TdrDelay registry key? see this thread

https://groups.google.com/forum/?hl=en#!topic/mcx-users/FA8E1o8o5KA

kaoben2731 commented 4 years ago

Yes I run it on a win10 machine, but the TdrDelay is already edited previously. image

Also, my screen is connected to the 1st GPU(750ti), and the benchmark can be done on this GPU, so I don't think it's watch-dog time limit causing the problem.

fangq commented 4 years ago

previously, one of my colleagues tested the TdrDelay setting, and it seems it applies to all GPUs no matter it is connected to a display or not.

other than making sure that you do a reboot, I don't really know if there are other tricks.

another way to disable this is to install nsight, there is a dialog, see

https://docs.nvidia.com/gameworks/content/developertools/desktop/nsight/timeout_detection_recovery.htm

fangq commented 4 years ago

closing, as it looks like a driver issue, not an MCX issue. Feel free to reopen if you find out otherwise.

kaoben2731 commented 4 years ago

I still cannot use these 2 GPU properly, even though I tried to use the CPU integrated graphics and make the 2 GPU both dedicated.

Anyway, great thanks for your help.

fangq commented 4 years ago

Again, I can't reproduce this problem on my side - I can run mcx on multiple GPUs on different platforms, dedicated or non-dedicated. Unless you can provide an example that one can duplicate the issue and debug, there is nothing we can do. Again, my assessment is related to your driver.