fangq / mcxcl

Monte Carlo eXtreme for OpenCL (MCXCL)
http://mcx.space/wiki/?Learn#mcxcl
Other
41 stars 29 forks source link

MCX-CL freezes on Windows above a certain photon number #29

Closed fangq closed 5 years ago

fangq commented 5 years ago

Subject: | [mcx-users] MCX-CL freezes on AMD Radeon R5 430 / Intel (R) HD Graphics 630 based computer when number of photons is bigger than 1.5e6 Fri, 26 Jul 2019 10:07:41 -0700 (PDT) Felipe Orihuela-Espina f.orihuela.espina@gmail.com mcx-users@googlegroups.com mcx-users mcx-users@googlegroups.com

Dear Prof. Fang, Thank you so much for sharing MCX. Our experience in the last few years with NVIDIA/CUDA based MCX has been excellent. Please find below details of what seems to be an issue with MCX-CL. I'm more than happy to try any suggestions that you may have. Hopefully we are not missing something very obvious (or perhaps, it would be better if we are!). Best, Felipe


Problem description:

MCX-CL freezes on AMD Radeon R5 430 / Intel (R) HD Graphics 630 based computer when number of photons is bigger than 1.5e6.

Further details:

Computer configuration:

TRIED AND CHECKLIST DONE SO FAR:



** RUN FROM MATLAB COMMAND LINE



%= CODE ==========================================================

clear all

%Simulation parameters
cfg.nphoton=1e7; 
cfg.gpuid=1; 
cfg.autopilot=1; 
cfg.tstart=0;    
cfg.tend=5e-9;  
cfg.tstep=5e-10; 
%cfg.savedetflag=['dsp'];
cfg.seed = rand();

%Medium parameters
cfg.prop=[0     0 1 1;...
          0.005 1 0 1.37];
cfg.vol=uint8(ones(90,90,90));
cfg.vol(:,:,1)=0;
cfg.issrcfrom0 =1;
cfg.issaveref  =1;
cfg.ismomentum =1;

%Sources 
cfg.srcpos=[30 45 1]; 
cfg.srcdir=[0 0 1];

%Detectors
cfg.detpos=[60 45 1 1]; 

%Run simulation
[fluences,detphoton,vols,seeds]=mcxlab(cfg,'opencl');

%= OUTPUT ==========================================================

MCX-CL-freeze_FreezedRunOnMatlab.png



** RUN ON MCXSTUDIO GUI



MCX-CL-freeze_ValidationOnMCXStudioGUI.png

MCX-CL-freeze_FreezedRunOnMCXStudioGUI.png



** RUN ON CMD CONSOLE



C:\Users\fo067\OneDrive\Documentos\Research\DCS\Code\MCXStudio\MCXSuite\mcxcl\bin>mcxcl -L
Platform [0] Name AMD Accelerated Parallel Processing
============ GPU device ID 0 [1 of 1]: AMD Radeon R5 430  ============
 Device 1 of 1:         AMD Radeon R5 430
 Compute units   :      6 core(s)
 Global memory   :      2147483648 B
 Local memory    :      32768 B
 Constant memory :      65536 B
 Clock speed     :      780 MHz
 GFXIP version:         6.0
 Stream Processor:      384
 Vendor name    :       AMD
 Auto-thread    :       24576
 Auto-block     :       64
============ CPU device ID 1 [1 of 1]: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz  ============
 Device 2 of 1:         Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
 Compute units   :      8 core(s)
 Global memory   :      34237833216 B
 Local memory    :      32768 B
 Constant memory :      65536 B
 Clock speed     :      3600 MHz
 Vendor name    :       Unknown
 Auto-thread    :       512
 Auto-block     :       64
Platform [1] Name Intel(R) OpenCL
============ GPU device ID 2 [1 of 1]: Intel(R) HD Graphics 630  ============
 Device 3 of 1:         Intel(R) HD Graphics 630
 Compute units   :      24 core(s)
 Global memory   :      13695131648 B
 Local memory    :      65536 B
 Constant memory :      4294959104 B
 Clock speed     :      1150 MHz
 Vendor name    :       IntelGPU
 Auto-thread    :       1536
 Auto-block     :       64
============ CPU device ID 3 [1 of 1]: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz  ============
 Device 4 of 1:         Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
 Compute units   :      8 core(s)
 Global memory   :      34237833216 B
 Local memory    :      32768 B
 Constant memory :      131072 B
 Clock speed     :      3600 MHz
 Vendor name    :       Intel
 Auto-thread    :       512
 Auto-block     :       64
C:\Users\fo067\OneDrive\Documentos\Research\DCS\Code\MCXStudio\MCXSuite\mcxcl\bin>cd ..\example\quicktest\

C:\Users\fo067\OneDrive\Documentos\Research\DCS\Code\MCXStudio\MCXSuite\mcxcl\example\quicktest>run_qtest.bat

C:\Users\fo067\OneDrive\Documentos\Research\DCS\Code\MCXStudio\MCXSuite\mcxcl\example\quicktest>..\..\bin\mcxcl.exe -A -g 10 -n 1e7 -f qtest.inp -s qtest -r 1 -a 0 -b 0 -G 1 -D P
==============================================================================
=                       Monte Carlo eXtreme (MCX) -- OpenCL                  =
=          Copyright (c) 2010-2019 Qianqian Fang <q.fang at neu.edu>         =
=                             http://mcx.space/                              =
=                                                                            =
= Computational Optics&Translational Imaging (COTI) Lab - http://fanglab.org =
=            Department of Bioengineering, Northeastern University           =
==============================================================================
=    The MCX Project is funded by the NIH/NIGMS under grant R01-GM114365     =
==============================================================================
$Rev::969907$2019.3 $Date::2019-07-16 23:26:10 -04$ by $Author::Qianqian Fang$
==============================================================================
- variant name: [Detective MCXCL] compiled with OpenCL version [1]
- compiled with: [RNG] xoroshiro128+ [Seed Length] 4
initializing streams ...        init complete : 1 ms
Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SRC_PENCIL  -DUSE_ATOMIC -DMCX_SAVE_DETECTORS -DINTERNAL_SOURCE
Kernel build log:
"C:\Users\fo067\AppData\Local\Temp\OCL15120T1.cl", line 164: warning: OpenCL
          extension is now part of core
  #pragma OPENCL EXTENSION cl_khr_fp64 : enable
                           ^

"C:\Users\fo067\AppData\Local\Temp\OCL15120T1.cl", line 190: warning: function
          "copystate" was declared but never referenced
  static void copystate(__private RandType t[RAND_BUF_LEN], __private RandType tnew[RAND_BUF_LEN]){
              ^

"C:\Users\fo067\AppData\Local\Temp\OCL15120T1.cl", line 196: warning: function
          "rand_need_more" was declared but never referenced
  static void rand_need_more(__private RandType t[RAND_BUF_LEN]){
              ^

"C:\Users\fo067\AppData\Local\Temp\OCL15120T1.cl", line 211: warning: function
          "gpu_rng_reseed" was declared but never referenced
  static void gpu_rng_reseed(__private RandType t[RAND_BUF_LEN],__global uint *cpuseed,uint idx,float reseed){
              ^

build program complete : 348 ms
- [device 0(1): AMD Radeon R5 430] threadph=651 oddphotons=640 np=10000000.0 nthread=15360 nblock=64 repetition=1
set kernel arguments complete : 350 ms
lauching mcx_main_loop for time window [0.0ns 5.0ns] ...
simulation run# 1 ...
Progress: [>                                                                                                      ]   0%progress=0
progress=7
progress=13
Progress: [=>                                                                                                     ]   1%progress=14
progress=326
Progress: [===================================>                                                                   ]  34%progress=326
progress=326
progress=326
progress=326
progress=326
progress=326
progress=326
progress=326
Terminate batch job (Y/N)? y

C:\Users\fo067\OneDrive\Documentos\Research\DCS\Code\MCXStudio\MCXSuite\mcxcl\example\quicktest>
fangq commented 5 years ago

hi Felipe

from the symptom (flickering screen), I am quite certain that the kernel was killed by the graphics driver TDR (Timeout Detection and Recovery). I assume that mcxcl had no problem running on the CPU target (-G 2 - using AMD driver, and -G 4 using intel driver)

Changing TdrDelay setting is the typical suggestion for fixing this on Windows. You mention you tried, but I have a feeling it was not effective. For NVIDIA GPU on windows, you can use a Nsight option dialogue to disable TDR (or change delay time), see

https://docs.nvidia.com/gameworks/content/developertools/desktop/timeout_detection_recovery.htm

but I am not sure if this method works for AMD or Intel GPU. maybe you want to give it a try?

Another thing to try is to disable the progress bar (remove -D P or uncheck Show progress bar). The progress bar feature is not very stable and sometimes can cause hanging (host keeps waiting despite the kernel has completed).

nsight_monitor_general_tdr_true 002

fangq commented 5 years ago

fixed by user https://groups.google.com/d/msg/mcx-users/VWtS7OeHK4o/-Ru2xKoFEQAJ