Bdot42 / mfakto

Mersenne number trial factoring using OpenCL, primarily for GIMPS: Great Internet Mersenne Prime Search
http://mersennewiki.org/index.php/Mfakto
GNU General Public License v3.0
32 stars 17 forks source link

Memory access fault by GPU #18

Open valeriob01 opened 4 years ago

valeriob01 commented 4 years ago

./mfakto -d 01 --perftest 1 mfakto 0.15pre6 (64bit build)

Runtime options Inifile mfakto.ini Verbosity 1 SieveOnGPU yes MoreClasses yes GPUSievePrimes 81157 GPUSieveProcessSize 24Ki bits GPUSieveSize 96Mi bits FlushInterval 0 WorkFile worktodo.txt ResultsFile results.txt Checkpoints enabled CheckpointDelay 300s Stages enabled StopAfterFactor class PrintMode compact V5UserID none ComputerID none TimeStampInResults yes VectorSize 2 GPUType AUTO SmallExp no UseBinfile mfakto_Kernels.elf Select device - Get device info: WARNING: Unknown GPU name, assuming GCN. Please post the device name "gfx900 (Advanced Micro Devices, Inc.)" to http://www.mersenneforum.org/showthread.php?t=15646 to have it added to mfakto. Set GPUType in mfakto.ini to select a GPU type yourself to avoid this warning.

OpenCL device info name gfx900 (Advanced Micro Devices, Inc.) device (driver) version OpenCL 2.0 (3098.0 (HSA1.1,LC)) maximum threads per block 1024 maximum threads per grid 1073741824 number of multiprocessors 64 (4096 compute elements) clock rate 1630MHz

Automatic parameters threads per grid 2097152 optimizing kernels for GCN

Compiling kernels.

Perftest

Generate list of the first 1075766 primes: 106.64 ms

  1. CPU-Sieve-Init (once per class, 960 times per test, avg. for 1 iterations) Init_class(sieveprimes= 5000): 0.40 ms Init_class(sieveprimes= 20000): 1.68 ms Init_class(sieveprimes= 80000): 7.47 ms Init_class(sieveprimes= 200000): 19.22 ms Init_class(sieveprimes= 500000): 50.81 ms Init_class(sieveprimes=1000000): 106.06 ms

  2. CPU-Sieve (output rate M/s) Sieve size is fixed at compile time, cannot test with variable sizes. Just running 3 fixed tests.

SievePrimes: 254 396 611 945 1460 2257 3487 5389 8328 12871 19890 30738 47503 73411 113449 175323 270944 418716 647083 1000000 SieveSizeLimit 36 kiB 574.5 523.2 487.0 447.3 405.0 377.5 348.5 318.6 289.2 257.9 222.9 201.5 164.1 126.3 99.3 79.3 64.0 50.9 39.2 30.5 36 kiB 582.8 534.6 490.9 443.9 407.0 376.9 350.3 319.1 288.7 258.8 226.2 203.2 164.4 125.9 100.0 80.0 64.2 50.1 39.9 30.4 36 kiB 571.3 525.6 487.5 437.1 407.2 375.3 344.6 317.1 288.1 258.7 227.0 203.1 164.3 126.1 99.7 80.0 64.5 50.8 39.8 30.6

Best SieveSizeLimit for SievePrimes: 254 396 611 945 1460 2257 3487 5389 8328 12871 19890 30738 47503 73411 113449 175323 270944 418716 647083 1000000 at kiB: 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 max M/s: 582.8 534.6 490.9 447.3 407.2 377.5 350.3 319.1 289.2 258.8 227.0 203.2 164.4 126.3 100.0 80.0 64.5 50.9 39.9 30.6 Survivors: 36.41% 34.06% 32.06% 30.28% 28.69% 27.27% 26.00% 24.84% 23.78% 22.82% 21.94% 21.12% 20.36% 19.66% 19.01% 18.40% 17.83% 17.29% 16.79% 16.32% removal rate 1017.9 1035.2 1040.3 1030.2 1012.0 1006.6 996.9 965.5 926.7 875.4 808.0 759.2 642.9 516.3 426.1 354.8 297.5 243.5 197.6 156.8

  1. Memory copy to GPU (blocks of 8388608 bytes)

    Standard copy, standard queue: 80 MB in 15.9 ms (5279.2 MB/s) (real)

    Standard copy, profiled queue: 80 MB in 16.0 ms (5229.5 MB/s) (real) 80 MB in 16.0 ms (5236.9 MB/s) (profiled data) 8 MB in 1.5 ms (5468.6 MB/s) (profiled data, peak)

    Standard copy, two queues: 80 MB in 14.4 ms (5820.6 MB/s) (real)

Reinitializing with gpu_sieving enabled. Select device - Get device info:

OpenCL device info name gfx900 (Advanced Micro Devices, Inc.) device (driver) version OpenCL 2.0 (3098.0 (HSA1.1,LC)) maximum threads per block 1024 maximum threads per grid 1073741824 number of multiprocessors 64 (4096 compute elements) clock rate 1630MHz

Automatic parameters threads per grid 2097152 optimizing kernels for GCN

Compiling kernels.

  1. GPU sieve, 1 iterations each GPUSievePrimes (adjusted) 52534 GPUsieve minimum exponent 646182

    gpusieve_init: 29.551000 ms (CPU work) gpusieve_init_exponent: 0.056000 ms (CalcModularInverses) Memory access fault by GPU node-1 (Agent handle: 0x5655576e89d0) on address 0x7efcfffbf000. Reason: Unknown. Aborted

valeriob01 commented 4 years ago

I hope this helps, compiler trace:

make
gcc -m64 -Wall -O3 -funroll-loops  -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -funroll-all-loops -funsafe-loop-optimizations -fira-region=all -fsched-spec-load -fsched-stalled-insns=10 -fsched-stalled-insns-dep=10 -fno-align-labels -c sieve.c -o sieve.o
gcc -m64 -Wall -O3 -funroll-loops  -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c timer.c -o timer.o
gcc -m64 -Wall -O3 -funroll-loops  -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c parse.c -o parse.o
gcc -m64 -Wall -O3 -funroll-loops  -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c read_config.c -o read_config.o
gcc -m64 -Wall -O3 -funroll-loops  -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c mfaktc.c -o mfaktc.o
gcc -m64 -Wall -O3 -funroll-loops  -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c checkpoint.c -o checkpoint.o
gcc -m64 -Wall -O3 -funroll-loops  -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c signal_handler.c -o signal_handler.o
gcc -m64 -Wall -O3 -funroll-loops  -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c filelocking.c -o filelocking.o
gcc -m64 -Wall -O3 -funroll-loops  -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c output.c -o output.o
gcc -m64 -Wall -O3 -funroll-loops  -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL  -c mfakto.cpp -o mfakto.o
mfakto.cpp: In function ‘int init_CL(int, cl_int*)’:
mfakto.cpp:553:83: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
  553 |   commandQueue = clCreateCommandQueue(context, devices[*devnumber], props, &status);
      |                                                                                   ^
In file included from my_types.h:25,
                 from mfakto.h:23,
                 from mfakto.cpp:25:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
 1364 | clCreateCommandQueue(cl_context                     /* context */,
      | ^~~~~~~~~~~~~~~~~~~~
mfakto.cpp:553:83: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
  553 |   commandQueue = clCreateCommandQueue(context, devices[*devnumber], props, &status);
      |                                                                                   ^
In file included from my_types.h:25,
                 from mfakto.h:23,
                 from mfakto.cpp:25:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
 1364 | clCreateCommandQueue(cl_context                     /* context */,
      | ^~~~~~~~~~~~~~~~~~~~
mfakto.cpp:557:85: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
  557 |     commandQueue = clCreateCommandQueue(context, devices[*devnumber], props, &status);
      |                                                                                     ^
In file included from my_types.h:25,
                 from mfakto.h:23,
                 from mfakto.cpp:25:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
 1364 | clCreateCommandQueue(cl_context                     /* context */,
      | ^~~~~~~~~~~~~~~~~~~~
mfakto.cpp:557:85: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
  557 |     commandQueue = clCreateCommandQueue(context, devices[*devnumber], props, &status);
      |                                                                                     ^
In file included from my_types.h:25,
                 from mfakto.h:23,
                 from mfakto.cpp:25:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
 1364 | clCreateCommandQueue(cl_context                     /* context */,
      | ^~~~~~~~~~~~~~~~~~~~
mfakto.cpp:571:86: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
  571 |   commandQueuePrf = clCreateCommandQueue(context, devices[*devnumber], props, &status);
      |                                                                                      ^
In file included from my_types.h:25,
                 from mfakto.h:23,
                 from mfakto.cpp:25:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
 1364 | clCreateCommandQueue(cl_context                     /* context */,
      | ^~~~~~~~~~~~~~~~~~~~
mfakto.cpp:571:86: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
  571 |   commandQueuePrf = clCreateCommandQueue(context, devices[*devnumber], props, &status);
      |                                                                                      ^
In file included from my_types.h:25,
                 from mfakto.h:23,
                 from mfakto.cpp:25:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
 1364 | clCreateCommandQueue(cl_context                     /* context */,
      | ^~~~~~~~~~~~~~~~~~~~
mfakto.cpp: In function ‘int run_mod_kernel(cl_ulong, cl_ulong, cl_ulong, cl_float, cl_ulong*, cl_ulong*)’:
mfakto.cpp:1682:26: warning: ‘cl_int clEnqueueTask(cl_command_queue, cl_kernel, cl_uint, _cl_event* const*, _cl_event**)’ is deprecated [-Wdeprecated-declarations]
 1682 |                  &mod_evt);
      |                          ^
In file included from my_types.h:25,
                 from mfakto.h:23,
                 from mfakto.cpp:25:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1378:1: note: declared here
 1378 | clEnqueueTask(cl_command_queue  /* command_queue */,
      | ^~~~~~~~~~~~~
mfakto.cpp:1682:26: warning: ‘cl_int clEnqueueTask(cl_command_queue, cl_kernel, cl_uint, _cl_event* const*, _cl_event**)’ is deprecated [-Wdeprecated-declarations]
 1682 |                  &mod_evt);
      |                          ^
In file included from my_types.h:25,
                 from mfakto.h:23,
                 from mfakto.cpp:25:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1378:1: note: declared here
 1378 | clEnqueueTask(cl_command_queue  /* command_queue */,
      | ^~~~~~~~~~~~~
mfakto.cpp: In function ‘int run_kernel15(cl_kernel, cl_uint, int75, int, cl_uint8, cl_mem, cl_int, cl_int)’:
mfakto.cpp:1750:5: note: the ABI for passing parameters with 32-byte alignment has changed in GCC 4.6
 1750 | int run_kernel15(cl_kernel l_kernel, cl_uint exp, int75 k_base, int stream, cl_uint8 b_in, cl_mem res, cl_int shiftcount, cl_int bin_max)
      |     ^~~~~~~~~~~~
gcc -m64 -Wall -O3 -funroll-loops  -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL  -c gpusieve.cpp -o gpusieve.o
gcc -m64 -Wall -O3 -funroll-loops  -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL  -c perftest.cpp -o perftest.o
perftest.cpp: In function ‘GPUKernels test_cpu_tf_kernels(cl_uint)’:
perftest.cpp:1021:27: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘cl_ulong’ {aka ‘long unsigned int’} [-Wformat=]
 1021 |   printf("exponent=%u, %lldM FCs (sieved: %lldM FCs) each, ",
      |                        ~~~^
      |                           |
      |                           long long int
      |                        %ld
 1022 |     mystuff.exponent, num_fcs >> 20, ((cl_ulong)num_loops*mystuff.threads_per_grid)>>20);
      |                       ~~~~~~~~~~~~~
      |                               |
      |                               cl_ulong {aka long unsigned int}
perftest.cpp:1021:46: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 4 has type ‘cl_ulong’ {aka ‘long unsigned int’} [-Wformat=]
 1021 |   printf("exponent=%u, %lldM FCs (sieved: %lldM FCs) each, ",
      |                                           ~~~^
      |                                              |
      |                                              long long int
      |                                           %ld
 1022 |     mystuff.exponent, num_fcs >> 20, ((cl_ulong)num_loops*mystuff.threads_per_grid)>>20);
      |                                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                                                                    |
      |                                                                                    cl_ulong {aka long unsigned int}
perftest.cpp:1025:16: warning: format ‘%llu’ expects argument of type ‘long long unsigned int’, but argument 2 has type ‘cl_ulong’ {aka ‘long unsigned int’} [-Wformat=]
 1025 |   printf("k=%llu, %f GHz-days (assignment), %f GHz-days (per test): ", k, ghzd, ghzdt); fflush(stdout);
      |             ~~~^                                                       ~
      |                |                                                       |
      |                long long unsigned int                                  cl_ulong {aka long unsigned int}
      |             %lu
perftest.cpp: In function ‘GPUKernels test_gpu_tf_kernels(cl_uint)’:
perftest.cpp:1157:27: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘cl_ulong’ {aka ‘long unsigned int’} [-Wformat=]
 1157 |   printf("exponent=%u, %lldM FCs each, ", mystuff.exponent, num_fcs>>20);
      |                        ~~~^                                 ~~~~~~~~~~~
      |                           |                                        |
      |                           long long int                            cl_ulong {aka long unsigned int}
      |                        %ld
perftest.cpp:1160:16: warning: format ‘%llu’ expects argument of type ‘long long unsigned int’, but argument 2 has type ‘cl_ulong’ {aka ‘long unsigned int’} [-Wformat=]
 1160 |   printf("k=%llu, %f GHz-days (assignment), %f GHz-days (per test): ", k, ghzd, ghzdt); fflush(stdout);
      |             ~~~^                                                       ~
      |                |                                                       |
      |                long long unsigned int                                  cl_ulong {aka long unsigned int}
      |             %lu
perftest.cpp: In function ‘void CL_test(cl_int)’:
perftest.cpp:1696:82: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
 1696 |   commandQueue = clCreateCommandQueue(context, devices[devnumber], props, &status);
      |                                                                                  ^
In file included from perftest.cpp:28:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
 1364 | clCreateCommandQueue(cl_context                     /* context */,
      | ^~~~~~~~~~~~~~~~~~~~
perftest.cpp:1696:82: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
 1696 |   commandQueue = clCreateCommandQueue(context, devices[devnumber], props, &status);
      |                                                                                  ^
In file included from perftest.cpp:28:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
 1364 | clCreateCommandQueue(cl_context                     /* context */,
      | ^~~~~~~~~~~~~~~~~~~~
perftest.cpp:1700:84: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
 1700 |     commandQueue = clCreateCommandQueue(context, devices[devnumber], props, &status);
      |                                                                                    ^
In file included from perftest.cpp:28:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
 1364 | clCreateCommandQueue(cl_context                     /* context */,
      | ^~~~~~~~~~~~~~~~~~~~
perftest.cpp:1700:84: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
 1700 |     commandQueue = clCreateCommandQueue(context, devices[devnumber], props, &status);
      |                                                                                    ^
In file included from perftest.cpp:28:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
 1364 | clCreateCommandQueue(cl_context                     /* context */,
      | ^~~~~~~~~~~~~~~~~~~~
perftest.cpp:1711:85: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
 1711 |   commandQueuePrf = clCreateCommandQueue(context, devices[devnumber], props, &status);
      |                                                                                     ^
In file included from perftest.cpp:28:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
 1364 | clCreateCommandQueue(cl_context                     /* context */,
      | ^~~~~~~~~~~~~~~~~~~~
perftest.cpp:1711:85: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
 1711 |   commandQueuePrf = clCreateCommandQueue(context, devices[devnumber], props, &status);
      |                                                                                     ^
In file included from perftest.cpp:28:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
 1364 | clCreateCommandQueue(cl_context                     /* context */,
      | ^~~~~~~~~~~~~~~~~~~~
perftest.cpp:1771:3: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation]
 1771 |   if (mystuff.CompileOptions[0])  // if mfakto.ini defined compile options, override the default with them
      |   ^~
perftest.cpp:1774:5: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘if’
 1774 |     printf("Compiling kernels (build options: \"%s\").", program_options);
      |     ^~~~~~
gcc -m64 -Wall -O3 -funroll-loops  -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL  -c menu.cpp -o menu.o
gcc -m64 -Wall -O3 -funroll-loops  -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL  -c kbhit.cpp -o kbhit.o
g++ sieve.o timer.o parse.o read_config.o mfaktc.o checkpoint.o signal_handler.o filelocking.o output.o mfakto.o gpusieve.o perftest.o menu.o kbhit.o -m64  -O3 -funroll-loops  -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -L/opt/rocm-3.3.0/opencl/lib/x86_64 -lOpenCL -o ../mfakto
mfakto.cpp: In function ‘tf_class_opencl.constprop’:
mfakto.cpp:2729:35: note: the ABI for passing parameters with 32-byte alignment has changed in GCC 4.6
 2729 |           status = run_gs_kernel15(kernel_info[use_kernel].kernel, numblocks, shared_mem_required, k_base, b_in, shiftcount);
      |                                   ^
mfakto.cpp: In function ‘run_gs_kernel15’:
mfakto.cpp:2140:5: note: the ABI for passing parameters with 32-byte alignment has changed in GCC 4.6
 2140 | int run_gs_kernel15(cl_kernel kernel, cl_uint numblocks, cl_uint shared_mem_required, int75 k_base, cl_uint8 b_in, cl_uint shiftcount)
      |     ^
read_config.c: In function ‘my_read_string’:
read_config.c:124:9: warning: ‘strncpy’ specified bound depends on the length of the source argument [-Wstringop-overflow=]
  124 |         strncpy(string, buf + idx + 1, found);
      |         ^
read_config.c:120:30: note: length computed here
  120 |       found = (unsigned int) strlen(buf + idx + 1);
      |                   
valeriob01 commented 4 years ago

./mfakto -d 01 --perftest 1 mfakto 0.15pre6 (64bit build)

Runtime options Inifile mfakto.ini Verbosity 1 SieveOnGPU yes MoreClasses yes GPUSievePrimes 81157 GPUSieveProcessSize 24Ki bits GPUSieveSize 96Mi bits FlushInterval 0 WorkFile worktodo.txt ResultsFile results.txt Checkpoints enabled CheckpointDelay 300s Stages enabled StopAfterFactor class PrintMode compact V5UserID none ComputerID none TimeStampInResults yes VectorSize 2 GPUType AUTO SmallExp no UseBinfile mfakto_Kernels.elf Select device - Get device info: WARNING: Unknown GPU name, assuming GCN. Please post the device name "gfx900 (Advanced Micro Devices, Inc.)" to http://www.mersenneforum.org/showthread.php?t=15646 to have it added to mfakto. Set GPUType in mfakto.ini to select a GPU type yourself to avoid this warning.

OpenCL device info name gfx900 (Advanced Micro Devices, Inc.) device (driver) version OpenCL 2.0 (3098.0 (HSA1.1,LC)) maximum threads per block 1024 maximum threads per grid 1073741824 number of multiprocessors 64 (4096 compute elements) clock rate 1630MHz

Automatic parameters threads per grid 2097152 optimizing kernels for GCN

Compiling kernels.

Perftest

Generate list of the first 1075766 primes: 106.64 ms

1. CPU-Sieve-Init (once per class, 960 times per test, avg. for 1 iterations)
   Init_class(sieveprimes=   5000):     0.40 ms
   Init_class(sieveprimes=  20000):     1.68 ms
   Init_class(sieveprimes=  80000):     7.47 ms
   Init_class(sieveprimes= 200000):    19.22 ms
   Init_class(sieveprimes= 500000):    50.81 ms
   Init_class(sieveprimes=1000000):   106.06 ms

2. CPU-Sieve (output rate M/s)
   Sieve size is fixed at compile time, cannot test with variable sizes. Just running 3 fixed tests.

SievePrimes: 254 396 611 945 1460 2257 3487 5389 8328 12871 19890 30738 47503 73411 113449 175323 270944 418716 647083 1000000 SieveSizeLimit 36 kiB 574.5 523.2 487.0 447.3 405.0 377.5 348.5 318.6 289.2 257.9 222.9 201.5 164.1 126.3 99.3 79.3 64.0 50.9 39.2 30.5 36 kiB 582.8 534.6 490.9 443.9 407.0 376.9 350.3 319.1 288.7 258.8 226.2 203.2 164.4 125.9 100.0 80.0 64.2 50.1 39.9 30.4 36 kiB 571.3 525.6 487.5 437.1 407.2 375.3 344.6 317.1 288.1 258.7 227.0 203.1 164.3 126.1 99.7 80.0 64.5 50.8 39.8 30.6

Best SieveSizeLimit for SievePrimes: 254 396 611 945 1460 2257 3487 5389 8328 12871 19890 30738 47503 73411 113449 175323 270944 418716 647083 1000000 at kiB: 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 max M/s: 582.8 534.6 490.9 447.3 407.2 377.5 350.3 319.1 289.2 258.8 227.0 203.2 164.4 126.3 100.0 80.0 64.5 50.9 39.9 30.6 Survivors: 36.41% 34.06% 32.06% 30.28% 28.69% 27.27% 26.00% 24.84% 23.78% 22.82% 21.94% 21.12% 20.36% 19.66% 19.01% 18.40% 17.83% 17.29% 16.79% 16.32% removal rate 1017.9 1035.2 1040.3 1030.2 1012.0 1006.6 996.9 965.5 926.7 875.4 808.0 759.2 642.9 516.3 426.1 354.8 297.5 243.5 197.6 156.8

1. Memory copy to GPU (blocks of 8388608 bytes)

Standard copy, standard queue: 80 MB in 15.9 ms (5279.2 MB/s) (real)

Standard copy, profiled queue: 80 MB in 16.0 ms (5229.5 MB/s) (real) 80 MB in 16.0 ms (5236.9 MB/s) (profiled data) 8 MB in 1.5 ms (5468.6 MB/s) (profiled data, peak)

Standard copy, two queues: 80 MB in 14.4 ms (5820.6 MB/s) (real)

Reinitializing with gpu_sieving enabled. Select device - Get device info:

OpenCL device info name gfx900 (Advanced Micro Devices, Inc.) device (driver) version OpenCL 2.0 (3098.0 (HSA1.1,LC)) maximum threads per block 1024 maximum threads per grid 1073741824 number of multiprocessors 64 (4096 compute elements) clock rate 1630MHz

Automatic parameters threads per grid 2097152 optimizing kernels for GCN

Compiling kernels.

1. GPU sieve, 1 iterations each
   GPUSievePrimes (adjusted) 52534
   GPUsieve minimum exponent 646182

gpusieve_init: 29.551000 ms (CPU work) gpusieve_init_exponent: 0.056000 ms (CalcModularInverses) Memory access fault by GPU node-1 (Agent handle: 0x5655576e89d0) on address 0x7efcfffbf000. Reason: Unknown. Aborted

This memory access fault happens on a variety of gpus, notably: RX580 and Vega64.

valeriob01 commented 4 years ago

To debug I have enabled "#define TRACE_SIEVE_KERNEL 5" in gpusieve.cl , attached is the output which hopefully shows the problem:

mfakto_debug.txt

stopped with ^C

valeriob01 commented 4 years ago

Test on AMD RX580 shows the "reason":

# ./mfakto -d 01 --perftest 1
mfakto 0.15pre6 (64bit build)

Runtime options
  Inifile                   mfakto.ini
  Verbosity                 1
  SieveOnGPU                yes
  MoreClasses               yes
  GPUSievePrimes            81157
  GPUSieveProcessSize       24Ki bits
  GPUSieveSize              96Mi bits
  FlushInterval             0
  WorkFile                  worktodo.txt
  ResultsFile               results.txt
  Checkpoints               enabled
  CheckpointDelay           300s
  Stages                    enabled
  StopAfterFactor           class
  PrintMode                 compact
  V5UserID                  selroc
  ComputerID                RX580-8
  TimeStampInResults        yes
  VectorSize                2
  GPUType                   AUTO
  SmallExp                  no
  UseBinfile                mfakto_Kernels.elf
Select device - Get device info:
WARNING: Unknown GPU name, assuming GCN. Please post the device name "gfx803 (Advanced Micro Devices, Inc.)" to http://www.mersenneforum.org/showthread.php?t=15646 to have it added to mfakto. Set GPUType in mfakto.ini to select a GPU type yourself to avoid this warning.

OpenCL device info
  name                      gfx803 (Advanced Micro Devices, Inc.)
  device (driver) version   OpenCL 1.2  (3098.0 (HSA1.1,LC))
  maximum threads per block 1024
  maximum threads per grid  1073741824
  number of multiprocessors 36 (2304 compute elements)
  clock rate                1360MHz

Automatic parameters
  threads per grid          2097152
  optimizing kernels for    GCN

Compiling kernels.

Perftest

Generate list of the first 1075766 primes: 135.07 ms

1. CPU-Sieve-Init (once per class, 960 times per test, avg. for 1 iterations)
    Init_class(sieveprimes=   5000):     0.47 ms
    Init_class(sieveprimes=  20000):     1.79 ms
    Init_class(sieveprimes=  80000):     7.78 ms
    Init_class(sieveprimes= 200000):    20.50 ms
    Init_class(sieveprimes= 500000):    54.19 ms
    Init_class(sieveprimes=1000000):   111.93 ms

2. CPU-Sieve (output rate M/s)
Sieve size is fixed at compile time, cannot test with variable sizes. Just running 3 fixed tests.

SievePrimes:     254     396     611     945    1460    2257    3487    5389    8328   12871   19890   30738   47503   73411  113449  175323  270944  418716  647083 1000000
SieveSizeLimit
    36 kiB     538.6   502.8   462.4   423.5   389.1   357.2   327.1   298.8   272.5   244.7   213.4   191.1   155.8   120.4    95.1    76.2    60.7    47.3    35.5    26.7
    36 kiB     544.8   501.9   461.1   422.7   387.3   355.1   325.6   298.7   271.9   244.1   213.1   191.0   155.6   119.9    95.4    76.1    60.6    46.9    35.9    26.6
    36 kiB     544.7   501.3   461.1   422.2   384.2   344.4   327.3   299.0   271.8   244.1   213.3   191.2   155.9   120.0    95.5    76.1    60.6    47.0    35.9    26.6

Best SieveSizeLimit for
SievePrimes:     254     396     611     945    1460    2257    3487    5389    8328   12871   19890   30738   47503   73411  113449  175323  270944  418716  647083 1000000
at kiB:           36      36      36      36      36      36      36      36      36      36      36      36      36      36      36      36      36      36      36      36
max M/s:       544.8   502.8   462.4   423.5   389.1   357.2   327.3   299.0   272.5   244.7   213.4   191.2   155.9   120.4    95.5    76.2    60.7    47.3    35.9    26.7
Survivors:    36.41%  34.06%  32.06%  30.28%  28.69%  27.27%  26.00%  24.84%  23.78%  22.82%  21.94%  21.12%  20.36%  19.66%  19.01%  18.40%  17.83%  17.29%  16.79%  16.32%
removal rate   951.4   973.5   980.0   975.3   966.9   952.5   931.5   904.7   873.3   827.8   759.5   714.3   609.5   492.1   406.6   337.8   279.6   226.4   178.1   136.7

3. Memory copy to GPU (blocks of 8388608 bytes)

  Standard copy, standard queue:
      80 MB in   19.1 ms (4400.9 MB/s) (real)

  Standard copy, profiled queue:
      80 MB in   18.8 ms (4457.0 MB/s) (real)
      80 MB in   18.8 ms (4467.0 MB/s) (profiled data)
       8 MB in    1.7 ms (4821.8 MB/s) (profiled data, peak)

  Standard copy, two queues:
      80 MB in   16.4 ms (5117.8 MB/s) (real)

Reinitializing with gpu_sieving enabled.
Select device - Get device info:

OpenCL device info
  name                      gfx803 (Advanced Micro Devices, Inc.)
  device (driver) version   OpenCL 1.2  (3098.0 (HSA1.1,LC))
  maximum threads per block 1024
  maximum threads per grid  1073741824
  number of multiprocessors 36 (2304 compute elements)
  clock rate                1360MHz

Automatic parameters
  threads per grid          2097152
  optimizing kernels for    GCN

Compiling kernels.

4. GPU sieve, 1 iterations each
  GPUSievePrimes (adjusted) 52534
  GPUsieve minimum exponent 646182

 gpusieve_init: 49.836000 ms (CPU work)
 gpusieve_init_exponent: 0.140000 ms (CalcModularInverses)
 gpusieve_init_class: 0.053000 ms (CalcBitToClear)
 gpusieve: 2.767000 ms (SegSieve)
Memory access fault by GPU node-4 (Agent handle: 0x561fb66a22f0) on address 0x2201675000. Reason: Page not present or supervisor privilege.
Aborted
valeriob01 commented 4 years ago

Increased the Verbosity to 3 to see what happens:


./mfakto -d 02 --perftest 1

Runtime options
  Inifile                   mfakto.ini
  Verbosity                 3
  SieveOnGPU                yes
  MoreClasses               yes
  GPUSievePrimes            81157
  GPUSieveProcessSize       24Ki bits
  GPUSieveSize              96Mi bits
  FlushInterval             0
  WorkFile                  worktodo.txt
  ResultsFile               results.txt
  Checkpoints               enabled
  CheckpointDelay           300s
  Stages                    enabled
  StopAfterFactor           class
  PrintMode                 compact
  V5UserID                  none
  ComputerID                none
  ProgressHeader            "Date    Time | class   Pct |   time     ETA | GHz-d/day    Sieve     Wait"
  ProgressFormat            "%d %T | %C %p%% | %t  %e |   %g  %s  %W%%"
  TimeStampInResults        yes
  VectorSize                2
  GPUType                   GCN
  SmallExp                  no
  UseBinfile                mfakto_Kernels.elf
Select device - Get device info - Device 2/2: gfx900 (Advanced Micro Devices, Inc.),
device version: OpenCL 2.0 , driver version: 3098.0 (HSA1.1,LC)
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 
Global memory:8573157376, Global memory cache: 16384, local memory: 65536, workgroup size: 256, Work dimensions: 3[1024, 1024, 1024, 0, 0] , Max clock speed:1630, compute units:64
Compiling kernels (build options: "-I. -DVECTOR_SIZE=2 -O3 -DMORE_CLASSES"). 
    BUILD OUTPUT
warning: argument unused during compilation: '-I .'
1 warning generated.

    END OF BUILD OUTPUT
Error 0 (Success): clBuildProgram

Perftest

Generate list of the first 1000000 primes: 2484.60 ms

Generate list of the first 1075766 primes for GPU sieving: 193.66 ms

1. CPU-Sieve-Init (once per class, 960 times per test, avg. for 1 iterations)
    Init_class(sieveprimes=   5000):     0.78 ms
    Init_class(sieveprimes=  20000):     3.00 ms
    Init_class(sieveprimes=  80000):    12.49 ms
    Init_class(sieveprimes= 200000):    32.72 ms
    Init_class(sieveprimes= 500000):    86.46 ms
    Init_class(sieveprimes=1000000):   178.87 ms

2. CPU-Sieve (output rate M/s)
Sieve size is fixed at compile time, cannot test with variable sizes. Just running 3 fixed tests.

SievePrimes:     254     396     611     945    1460    2257    3487    5389    8328   12871   19890   30738   47503   73411  113449  175323  270944  418716  647083 1000000
SieveSizeLimit
    36 kiB     172.3   138.2   112.8    94.2    80.8    70.3    61.9    55.1    49.5    44.6    40.3    36.8    32.1    27.9    23.9    20.5    17.6    14.7    11.8     9.5
    36 kiB     172.7   138.3   112.8    94.2    80.8    70.3    61.9    55.1    49.5    44.6    40.3    36.8    32.0    27.8    24.1    20.5    17.6    14.6    12.0     9.5
    36 kiB     172.8   138.2   112.9    94.2    80.8    70.2    61.9    55.2    49.5    44.6    40.3    36.8    32.1    27.8    24.1    20.5    17.6    14.6    12.0     9.5

Best SieveSizeLimit for
SievePrimes:     254     396     611     945    1460    2257    3487    5389    8328   12871   19890   30738   47503   73411  113449  175323  270944  418716  647083 1000000
at kiB:           36      36      36      36      36      36      36      36      36      36      36      36      36      36      36      36      36      36      36      36
max M/s:       172.8   138.3   112.9    94.2    80.8    70.3    61.9    55.2    49.5    44.6    40.3    36.8    32.1    27.9    24.1    20.5    17.6    14.7    12.0     9.5
Survivors:    36.41%  34.06%  32.06%  30.28%  28.69%  27.27%  26.00%  24.84%  23.78%  22.82%  21.94%  21.12%  20.36%  19.66%  19.01%  18.40%  17.83%  17.29%  16.79%  16.32%
removal rate   301.7   267.9   239.3   217.0   200.9   187.4   176.2   166.9   158.6   151.0   143.4   137.5   125.5   113.9   102.8    91.1    81.1    70.5    59.3    48.6

3. Memory copy to GPU (blocks of 8388608 bytes)

  Standard copy, standard queue:
      80 MB in    0.0 ms (2267191.4 MB/s) (real)

  Standard copy, profiled queue:
      80 MB in    0.0 ms (5592405.3 MB/s) (real)
      80 MB in    0.0 ms (30404523.4 MB/s) (profiled data)
       8 MB in    0.0 ms (39945752.4 MB/s) (profiled data, peak)

  Standard copy, two queues:
      80 MB in    0.0 ms (3495253.3 MB/s) (real)

Reinitializing with gpu_sieving enabled.
Select device - Get device info - Device 2/2: gfx900 (Advanced Micro Devices, Inc.),
device version: OpenCL 2.0 , driver version: 3098.0 (HSA1.1,LC)
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 
Global memory:8573157376, Global memory cache: 16384, local memory: 65536, workgroup size: 256, Work dimensions: 3[1024, 1024, 1024, 0, 0] , Max clock speed:1630, compute units:64
Compiling kernels (build options: "-I. -DVECTOR_SIZE=2 -O3 -DMORE_CLASSES -DCL_GPU_SIEVE"). 
    BUILD OUTPUT
warning: argument unused during compilation: '-I .'
1 warning generated.

    END OF BUILD OUTPUT
Error 0 (Success): clBuildProgram

4. GPU sieve, 1 iterations each

 gpusieve_init: 6.518000 ms (CPU work)
 gpusieve_init_exponent: 0.580500 ms (CalcModularInverses)
Memory access fault by GPU node-2 (Agent handle: 0x5655576e6e10) on address 0x7efcfc680000. Reason: Page not present or supervisor privilege.
Aborted
valeriob01 commented 4 years ago

https://github.com/RadeonOpenCompute/ROCm/issues/1103

proski commented 4 days ago

@valeriob01 Can this issue be closed? Is it still reproducible?