Open maolun opened 6 years ago
Hello,
I made a change on line 117 of the file, OclHost.cpp, from "if (ciErrNum == -18)" to "if (ciErrNum == -30), and then compiled the sources using cmake. Now the NGM works.
Mao-Lun
Thanks Fritz
Hi @maolun, hi @fritzsedlazeck,
I had the very same problem on this machine: https://www3.risc.jku.at/projects/mach2/. As far as I can tell the problem occurs as soon as a computer has more than 256 cores/hardware threads. MACH2 has more than 1700 cores and the Knights Landing (KNL) compute nodes of Stampede 2 have 272 hardware threads (if I read the documentation correctly).
My solution for the problem was this change to the code:
--- lib/mason/opencl/OclHost.cpp.orig 2019-05-14 16:33:09.313712490 +0200
+++ lib/mason/opencl/OclHost.cpp 2019-05-14 16:30:00.601698181 +0200
@@ -111,8 +111,8 @@
props[1] = 1; // 4 compute units per sub-device
props[2] = 0;
- devices = (cl_device_id *) malloc(256 * sizeof(cl_device_id));
- ciErrNum = clCreateSubDevices(device_id, props, 256, devices,
+ devices = (cl_device_id *) malloc(2560 * sizeof(cl_device_id));
+ ciErrNum = clCreateSubDevices(device_id, props, 2560, devices,
&ciDeviceCount);
if (ciErrNum == -18) {
ciDeviceCount = 1;
This works for me but will fail as soon as there is a machine with more than 2560 cores (per node).
A better solution might be to first find the core count (maybe like this: https://stackoverflow.com/questions/150355/programmatically-find-the-number-of-cores-on-a-machine) and use this number in the malloc
and the clCreateSubDevices
calls.
What do you think?
BTW: that comment "4 compute units per sub-device
" you see above is most probably wrong, isn't it?
Greetings Hermann
Hi Hermann, thanks for digging in. I think that comment was left over from the GPU code.... Thanks for looking at this. Fritz
Hello,
I was tying NGM at Texas Advanced Computing Center (https://portal.tacc.utexas.edu/user-guides/stampede2). However, an error occurs constantly. I compiled the NGM through CMake. I wonder if anyone has insight on how to solve this issue. Thanks. Any suggestion is greatly appreciated.
ESC[AESC[2K[OPENCL] Available platforms: 1 [OPENCL] AMD Accelerated Parallel Processing [OPENCL] Selecting OpenCl platform: AMD Accelerated Parallel Processing [OPENCL] Platform: OpenCL 1.2 AMD-APP (1214.3) [OPENCL] 1 CPU device found. [OPENCL] Device 0: Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz (Driver: 1214.3 (sse2,avx)) [OPENCL] Couldn't create sub-devices. Error: Error: Invalid value (-30) terminate called without an active exception
Best, Mao-Lun