Open valeriob01 opened 4 years ago
I hope this helps, compiler trace:
make
gcc -m64 -Wall -O3 -funroll-loops -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -funroll-all-loops -funsafe-loop-optimizations -fira-region=all -fsched-spec-load -fsched-stalled-insns=10 -fsched-stalled-insns-dep=10 -fno-align-labels -c sieve.c -o sieve.o
gcc -m64 -Wall -O3 -funroll-loops -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c timer.c -o timer.o
gcc -m64 -Wall -O3 -funroll-loops -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c parse.c -o parse.o
gcc -m64 -Wall -O3 -funroll-loops -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c read_config.c -o read_config.o
gcc -m64 -Wall -O3 -funroll-loops -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c mfaktc.c -o mfaktc.o
gcc -m64 -Wall -O3 -funroll-loops -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c checkpoint.c -o checkpoint.o
gcc -m64 -Wall -O3 -funroll-loops -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c signal_handler.c -o signal_handler.o
gcc -m64 -Wall -O3 -funroll-loops -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c filelocking.c -o filelocking.o
gcc -m64 -Wall -O3 -funroll-loops -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c output.c -o output.o
gcc -m64 -Wall -O3 -funroll-loops -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c mfakto.cpp -o mfakto.o
mfakto.cpp: In function ‘int init_CL(int, cl_int*)’:
mfakto.cpp:553:83: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
553 | commandQueue = clCreateCommandQueue(context, devices[*devnumber], props, &status);
| ^
In file included from my_types.h:25,
from mfakto.h:23,
from mfakto.cpp:25:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
1364 | clCreateCommandQueue(cl_context /* context */,
| ^~~~~~~~~~~~~~~~~~~~
mfakto.cpp:553:83: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
553 | commandQueue = clCreateCommandQueue(context, devices[*devnumber], props, &status);
| ^
In file included from my_types.h:25,
from mfakto.h:23,
from mfakto.cpp:25:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
1364 | clCreateCommandQueue(cl_context /* context */,
| ^~~~~~~~~~~~~~~~~~~~
mfakto.cpp:557:85: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
557 | commandQueue = clCreateCommandQueue(context, devices[*devnumber], props, &status);
| ^
In file included from my_types.h:25,
from mfakto.h:23,
from mfakto.cpp:25:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
1364 | clCreateCommandQueue(cl_context /* context */,
| ^~~~~~~~~~~~~~~~~~~~
mfakto.cpp:557:85: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
557 | commandQueue = clCreateCommandQueue(context, devices[*devnumber], props, &status);
| ^
In file included from my_types.h:25,
from mfakto.h:23,
from mfakto.cpp:25:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
1364 | clCreateCommandQueue(cl_context /* context */,
| ^~~~~~~~~~~~~~~~~~~~
mfakto.cpp:571:86: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
571 | commandQueuePrf = clCreateCommandQueue(context, devices[*devnumber], props, &status);
| ^
In file included from my_types.h:25,
from mfakto.h:23,
from mfakto.cpp:25:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
1364 | clCreateCommandQueue(cl_context /* context */,
| ^~~~~~~~~~~~~~~~~~~~
mfakto.cpp:571:86: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
571 | commandQueuePrf = clCreateCommandQueue(context, devices[*devnumber], props, &status);
| ^
In file included from my_types.h:25,
from mfakto.h:23,
from mfakto.cpp:25:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
1364 | clCreateCommandQueue(cl_context /* context */,
| ^~~~~~~~~~~~~~~~~~~~
mfakto.cpp: In function ‘int run_mod_kernel(cl_ulong, cl_ulong, cl_ulong, cl_float, cl_ulong*, cl_ulong*)’:
mfakto.cpp:1682:26: warning: ‘cl_int clEnqueueTask(cl_command_queue, cl_kernel, cl_uint, _cl_event* const*, _cl_event**)’ is deprecated [-Wdeprecated-declarations]
1682 | &mod_evt);
| ^
In file included from my_types.h:25,
from mfakto.h:23,
from mfakto.cpp:25:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1378:1: note: declared here
1378 | clEnqueueTask(cl_command_queue /* command_queue */,
| ^~~~~~~~~~~~~
mfakto.cpp:1682:26: warning: ‘cl_int clEnqueueTask(cl_command_queue, cl_kernel, cl_uint, _cl_event* const*, _cl_event**)’ is deprecated [-Wdeprecated-declarations]
1682 | &mod_evt);
| ^
In file included from my_types.h:25,
from mfakto.h:23,
from mfakto.cpp:25:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1378:1: note: declared here
1378 | clEnqueueTask(cl_command_queue /* command_queue */,
| ^~~~~~~~~~~~~
mfakto.cpp: In function ‘int run_kernel15(cl_kernel, cl_uint, int75, int, cl_uint8, cl_mem, cl_int, cl_int)’:
mfakto.cpp:1750:5: note: the ABI for passing parameters with 32-byte alignment has changed in GCC 4.6
1750 | int run_kernel15(cl_kernel l_kernel, cl_uint exp, int75 k_base, int stream, cl_uint8 b_in, cl_mem res, cl_int shiftcount, cl_int bin_max)
| ^~~~~~~~~~~~
gcc -m64 -Wall -O3 -funroll-loops -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c gpusieve.cpp -o gpusieve.o
gcc -m64 -Wall -O3 -funroll-loops -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c perftest.cpp -o perftest.o
perftest.cpp: In function ‘GPUKernels test_cpu_tf_kernels(cl_uint)’:
perftest.cpp:1021:27: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘cl_ulong’ {aka ‘long unsigned int’} [-Wformat=]
1021 | printf("exponent=%u, %lldM FCs (sieved: %lldM FCs) each, ",
| ~~~^
| |
| long long int
| %ld
1022 | mystuff.exponent, num_fcs >> 20, ((cl_ulong)num_loops*mystuff.threads_per_grid)>>20);
| ~~~~~~~~~~~~~
| |
| cl_ulong {aka long unsigned int}
perftest.cpp:1021:46: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 4 has type ‘cl_ulong’ {aka ‘long unsigned int’} [-Wformat=]
1021 | printf("exponent=%u, %lldM FCs (sieved: %lldM FCs) each, ",
| ~~~^
| |
| long long int
| %ld
1022 | mystuff.exponent, num_fcs >> 20, ((cl_ulong)num_loops*mystuff.threads_per_grid)>>20);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| cl_ulong {aka long unsigned int}
perftest.cpp:1025:16: warning: format ‘%llu’ expects argument of type ‘long long unsigned int’, but argument 2 has type ‘cl_ulong’ {aka ‘long unsigned int’} [-Wformat=]
1025 | printf("k=%llu, %f GHz-days (assignment), %f GHz-days (per test): ", k, ghzd, ghzdt); fflush(stdout);
| ~~~^ ~
| | |
| long long unsigned int cl_ulong {aka long unsigned int}
| %lu
perftest.cpp: In function ‘GPUKernels test_gpu_tf_kernels(cl_uint)’:
perftest.cpp:1157:27: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘cl_ulong’ {aka ‘long unsigned int’} [-Wformat=]
1157 | printf("exponent=%u, %lldM FCs each, ", mystuff.exponent, num_fcs>>20);
| ~~~^ ~~~~~~~~~~~
| | |
| long long int cl_ulong {aka long unsigned int}
| %ld
perftest.cpp:1160:16: warning: format ‘%llu’ expects argument of type ‘long long unsigned int’, but argument 2 has type ‘cl_ulong’ {aka ‘long unsigned int’} [-Wformat=]
1160 | printf("k=%llu, %f GHz-days (assignment), %f GHz-days (per test): ", k, ghzd, ghzdt); fflush(stdout);
| ~~~^ ~
| | |
| long long unsigned int cl_ulong {aka long unsigned int}
| %lu
perftest.cpp: In function ‘void CL_test(cl_int)’:
perftest.cpp:1696:82: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
1696 | commandQueue = clCreateCommandQueue(context, devices[devnumber], props, &status);
| ^
In file included from perftest.cpp:28:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
1364 | clCreateCommandQueue(cl_context /* context */,
| ^~~~~~~~~~~~~~~~~~~~
perftest.cpp:1696:82: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
1696 | commandQueue = clCreateCommandQueue(context, devices[devnumber], props, &status);
| ^
In file included from perftest.cpp:28:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
1364 | clCreateCommandQueue(cl_context /* context */,
| ^~~~~~~~~~~~~~~~~~~~
perftest.cpp:1700:84: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
1700 | commandQueue = clCreateCommandQueue(context, devices[devnumber], props, &status);
| ^
In file included from perftest.cpp:28:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
1364 | clCreateCommandQueue(cl_context /* context */,
| ^~~~~~~~~~~~~~~~~~~~
perftest.cpp:1700:84: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
1700 | commandQueue = clCreateCommandQueue(context, devices[devnumber], props, &status);
| ^
In file included from perftest.cpp:28:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
1364 | clCreateCommandQueue(cl_context /* context */,
| ^~~~~~~~~~~~~~~~~~~~
perftest.cpp:1711:85: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
1711 | commandQueuePrf = clCreateCommandQueue(context, devices[devnumber], props, &status);
| ^
In file included from perftest.cpp:28:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
1364 | clCreateCommandQueue(cl_context /* context */,
| ^~~~~~~~~~~~~~~~~~~~
perftest.cpp:1711:85: warning: ‘_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)’ is deprecated [-Wdeprecated-declarations]
1711 | commandQueuePrf = clCreateCommandQueue(context, devices[devnumber], props, &status);
| ^
In file included from perftest.cpp:28:
/opt/rocm-3.3.0/opencl/include/CL/cl.h:1364:1: note: declared here
1364 | clCreateCommandQueue(cl_context /* context */,
| ^~~~~~~~~~~~~~~~~~~~
perftest.cpp:1771:3: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation]
1771 | if (mystuff.CompileOptions[0]) // if mfakto.ini defined compile options, override the default with them
| ^~
perftest.cpp:1774:5: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘if’
1774 | printf("Compiling kernels (build options: \"%s\").", program_options);
| ^~~~~~
gcc -m64 -Wall -O3 -funroll-loops -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c menu.cpp -o menu.o
gcc -m64 -Wall -O3 -funroll-loops -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/rocm-3.3.0/opencl/include -DBUILD_OPENCL -c kbhit.cpp -o kbhit.o
g++ sieve.o timer.o parse.o read_config.o mfaktc.o checkpoint.o signal_handler.o filelocking.o output.o mfakto.o gpusieve.o perftest.o menu.o kbhit.o -m64 -O3 -funroll-loops -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -L/opt/rocm-3.3.0/opencl/lib/x86_64 -lOpenCL -o ../mfakto
mfakto.cpp: In function ‘tf_class_opencl.constprop’:
mfakto.cpp:2729:35: note: the ABI for passing parameters with 32-byte alignment has changed in GCC 4.6
2729 | status = run_gs_kernel15(kernel_info[use_kernel].kernel, numblocks, shared_mem_required, k_base, b_in, shiftcount);
| ^
mfakto.cpp: In function ‘run_gs_kernel15’:
mfakto.cpp:2140:5: note: the ABI for passing parameters with 32-byte alignment has changed in GCC 4.6
2140 | int run_gs_kernel15(cl_kernel kernel, cl_uint numblocks, cl_uint shared_mem_required, int75 k_base, cl_uint8 b_in, cl_uint shiftcount)
| ^
read_config.c: In function ‘my_read_string’:
read_config.c:124:9: warning: ‘strncpy’ specified bound depends on the length of the source argument [-Wstringop-overflow=]
124 | strncpy(string, buf + idx + 1, found);
| ^
read_config.c:120:30: note: length computed here
120 | found = (unsigned int) strlen(buf + idx + 1);
|
./mfakto -d 01 --perftest 1 mfakto 0.15pre6 (64bit build)
Runtime options Inifile mfakto.ini Verbosity 1 SieveOnGPU yes MoreClasses yes GPUSievePrimes 81157 GPUSieveProcessSize 24Ki bits GPUSieveSize 96Mi bits FlushInterval 0 WorkFile worktodo.txt ResultsFile results.txt Checkpoints enabled CheckpointDelay 300s Stages enabled StopAfterFactor class PrintMode compact V5UserID none ComputerID none TimeStampInResults yes VectorSize 2 GPUType AUTO SmallExp no UseBinfile mfakto_Kernels.elf Select device - Get device info: WARNING: Unknown GPU name, assuming GCN. Please post the device name "gfx900 (Advanced Micro Devices, Inc.)" to http://www.mersenneforum.org/showthread.php?t=15646 to have it added to mfakto. Set GPUType in mfakto.ini to select a GPU type yourself to avoid this warning.
OpenCL device info name gfx900 (Advanced Micro Devices, Inc.) device (driver) version OpenCL 2.0 (3098.0 (HSA1.1,LC)) maximum threads per block 1024 maximum threads per grid 1073741824 number of multiprocessors 64 (4096 compute elements) clock rate 1630MHz
Automatic parameters threads per grid 2097152 optimizing kernels for GCN
Compiling kernels.
Perftest
Generate list of the first 1075766 primes: 106.64 ms
1. CPU-Sieve-Init (once per class, 960 times per test, avg. for 1 iterations) Init_class(sieveprimes= 5000): 0.40 ms Init_class(sieveprimes= 20000): 1.68 ms Init_class(sieveprimes= 80000): 7.47 ms Init_class(sieveprimes= 200000): 19.22 ms Init_class(sieveprimes= 500000): 50.81 ms Init_class(sieveprimes=1000000): 106.06 ms 2. CPU-Sieve (output rate M/s) Sieve size is fixed at compile time, cannot test with variable sizes. Just running 3 fixed tests.
SievePrimes: 254 396 611 945 1460 2257 3487 5389 8328 12871 19890 30738 47503 73411 113449 175323 270944 418716 647083 1000000 SieveSizeLimit 36 kiB 574.5 523.2 487.0 447.3 405.0 377.5 348.5 318.6 289.2 257.9 222.9 201.5 164.1 126.3 99.3 79.3 64.0 50.9 39.2 30.5 36 kiB 582.8 534.6 490.9 443.9 407.0 376.9 350.3 319.1 288.7 258.8 226.2 203.2 164.4 125.9 100.0 80.0 64.2 50.1 39.9 30.4 36 kiB 571.3 525.6 487.5 437.1 407.2 375.3 344.6 317.1 288.1 258.7 227.0 203.1 164.3 126.1 99.7 80.0 64.5 50.8 39.8 30.6
Best SieveSizeLimit for SievePrimes: 254 396 611 945 1460 2257 3487 5389 8328 12871 19890 30738 47503 73411 113449 175323 270944 418716 647083 1000000 at kiB: 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 max M/s: 582.8 534.6 490.9 447.3 407.2 377.5 350.3 319.1 289.2 258.8 227.0 203.2 164.4 126.3 100.0 80.0 64.5 50.9 39.9 30.6 Survivors: 36.41% 34.06% 32.06% 30.28% 28.69% 27.27% 26.00% 24.84% 23.78% 22.82% 21.94% 21.12% 20.36% 19.66% 19.01% 18.40% 17.83% 17.29% 16.79% 16.32% removal rate 1017.9 1035.2 1040.3 1030.2 1012.0 1006.6 996.9 965.5 926.7 875.4 808.0 759.2 642.9 516.3 426.1 354.8 297.5 243.5 197.6 156.8
1. Memory copy to GPU (blocks of 8388608 bytes)
Standard copy, standard queue: 80 MB in 15.9 ms (5279.2 MB/s) (real)
Standard copy, profiled queue: 80 MB in 16.0 ms (5229.5 MB/s) (real) 80 MB in 16.0 ms (5236.9 MB/s) (profiled data) 8 MB in 1.5 ms (5468.6 MB/s) (profiled data, peak)
Standard copy, two queues: 80 MB in 14.4 ms (5820.6 MB/s) (real)
Reinitializing with gpu_sieving enabled. Select device - Get device info:
OpenCL device info name gfx900 (Advanced Micro Devices, Inc.) device (driver) version OpenCL 2.0 (3098.0 (HSA1.1,LC)) maximum threads per block 1024 maximum threads per grid 1073741824 number of multiprocessors 64 (4096 compute elements) clock rate 1630MHz
Automatic parameters threads per grid 2097152 optimizing kernels for GCN
Compiling kernels.
1. GPU sieve, 1 iterations each GPUSievePrimes (adjusted) 52534 GPUsieve minimum exponent 646182
gpusieve_init: 29.551000 ms (CPU work) gpusieve_init_exponent: 0.056000 ms (CalcModularInverses) Memory access fault by GPU node-1 (Agent handle: 0x5655576e89d0) on address 0x7efcfffbf000. Reason: Unknown. Aborted
This memory access fault happens on a variety of gpus, notably: RX580 and Vega64.
To debug I have enabled "#define TRACE_SIEVE_KERNEL 5" in gpusieve.cl , attached is the output which hopefully shows the problem:
stopped with ^C
Test on AMD RX580 shows the "reason":
# ./mfakto -d 01 --perftest 1
mfakto 0.15pre6 (64bit build)
Runtime options
Inifile mfakto.ini
Verbosity 1
SieveOnGPU yes
MoreClasses yes
GPUSievePrimes 81157
GPUSieveProcessSize 24Ki bits
GPUSieveSize 96Mi bits
FlushInterval 0
WorkFile worktodo.txt
ResultsFile results.txt
Checkpoints enabled
CheckpointDelay 300s
Stages enabled
StopAfterFactor class
PrintMode compact
V5UserID selroc
ComputerID RX580-8
TimeStampInResults yes
VectorSize 2
GPUType AUTO
SmallExp no
UseBinfile mfakto_Kernels.elf
Select device - Get device info:
WARNING: Unknown GPU name, assuming GCN. Please post the device name "gfx803 (Advanced Micro Devices, Inc.)" to http://www.mersenneforum.org/showthread.php?t=15646 to have it added to mfakto. Set GPUType in mfakto.ini to select a GPU type yourself to avoid this warning.
OpenCL device info
name gfx803 (Advanced Micro Devices, Inc.)
device (driver) version OpenCL 1.2 (3098.0 (HSA1.1,LC))
maximum threads per block 1024
maximum threads per grid 1073741824
number of multiprocessors 36 (2304 compute elements)
clock rate 1360MHz
Automatic parameters
threads per grid 2097152
optimizing kernels for GCN
Compiling kernels.
Perftest
Generate list of the first 1075766 primes: 135.07 ms
1. CPU-Sieve-Init (once per class, 960 times per test, avg. for 1 iterations)
Init_class(sieveprimes= 5000): 0.47 ms
Init_class(sieveprimes= 20000): 1.79 ms
Init_class(sieveprimes= 80000): 7.78 ms
Init_class(sieveprimes= 200000): 20.50 ms
Init_class(sieveprimes= 500000): 54.19 ms
Init_class(sieveprimes=1000000): 111.93 ms
2. CPU-Sieve (output rate M/s)
Sieve size is fixed at compile time, cannot test with variable sizes. Just running 3 fixed tests.
SievePrimes: 254 396 611 945 1460 2257 3487 5389 8328 12871 19890 30738 47503 73411 113449 175323 270944 418716 647083 1000000
SieveSizeLimit
36 kiB 538.6 502.8 462.4 423.5 389.1 357.2 327.1 298.8 272.5 244.7 213.4 191.1 155.8 120.4 95.1 76.2 60.7 47.3 35.5 26.7
36 kiB 544.8 501.9 461.1 422.7 387.3 355.1 325.6 298.7 271.9 244.1 213.1 191.0 155.6 119.9 95.4 76.1 60.6 46.9 35.9 26.6
36 kiB 544.7 501.3 461.1 422.2 384.2 344.4 327.3 299.0 271.8 244.1 213.3 191.2 155.9 120.0 95.5 76.1 60.6 47.0 35.9 26.6
Best SieveSizeLimit for
SievePrimes: 254 396 611 945 1460 2257 3487 5389 8328 12871 19890 30738 47503 73411 113449 175323 270944 418716 647083 1000000
at kiB: 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36
max M/s: 544.8 502.8 462.4 423.5 389.1 357.2 327.3 299.0 272.5 244.7 213.4 191.2 155.9 120.4 95.5 76.2 60.7 47.3 35.9 26.7
Survivors: 36.41% 34.06% 32.06% 30.28% 28.69% 27.27% 26.00% 24.84% 23.78% 22.82% 21.94% 21.12% 20.36% 19.66% 19.01% 18.40% 17.83% 17.29% 16.79% 16.32%
removal rate 951.4 973.5 980.0 975.3 966.9 952.5 931.5 904.7 873.3 827.8 759.5 714.3 609.5 492.1 406.6 337.8 279.6 226.4 178.1 136.7
3. Memory copy to GPU (blocks of 8388608 bytes)
Standard copy, standard queue:
80 MB in 19.1 ms (4400.9 MB/s) (real)
Standard copy, profiled queue:
80 MB in 18.8 ms (4457.0 MB/s) (real)
80 MB in 18.8 ms (4467.0 MB/s) (profiled data)
8 MB in 1.7 ms (4821.8 MB/s) (profiled data, peak)
Standard copy, two queues:
80 MB in 16.4 ms (5117.8 MB/s) (real)
Reinitializing with gpu_sieving enabled.
Select device - Get device info:
OpenCL device info
name gfx803 (Advanced Micro Devices, Inc.)
device (driver) version OpenCL 1.2 (3098.0 (HSA1.1,LC))
maximum threads per block 1024
maximum threads per grid 1073741824
number of multiprocessors 36 (2304 compute elements)
clock rate 1360MHz
Automatic parameters
threads per grid 2097152
optimizing kernels for GCN
Compiling kernels.
4. GPU sieve, 1 iterations each
GPUSievePrimes (adjusted) 52534
GPUsieve minimum exponent 646182
gpusieve_init: 49.836000 ms (CPU work)
gpusieve_init_exponent: 0.140000 ms (CalcModularInverses)
gpusieve_init_class: 0.053000 ms (CalcBitToClear)
gpusieve: 2.767000 ms (SegSieve)
Memory access fault by GPU node-4 (Agent handle: 0x561fb66a22f0) on address 0x2201675000. Reason: Page not present or supervisor privilege.
Aborted
Increased the Verbosity to 3 to see what happens:
./mfakto -d 02 --perftest 1
Runtime options
Inifile mfakto.ini
Verbosity 3
SieveOnGPU yes
MoreClasses yes
GPUSievePrimes 81157
GPUSieveProcessSize 24Ki bits
GPUSieveSize 96Mi bits
FlushInterval 0
WorkFile worktodo.txt
ResultsFile results.txt
Checkpoints enabled
CheckpointDelay 300s
Stages enabled
StopAfterFactor class
PrintMode compact
V5UserID none
ComputerID none
ProgressHeader "Date Time | class Pct | time ETA | GHz-d/day Sieve Wait"
ProgressFormat "%d %T | %C %p%% | %t %e | %g %s %W%%"
TimeStampInResults yes
VectorSize 2
GPUType GCN
SmallExp no
UseBinfile mfakto_Kernels.elf
Select device - Get device info - Device 2/2: gfx900 (Advanced Micro Devices, Inc.),
device version: OpenCL 2.0 , driver version: 3098.0 (HSA1.1,LC)
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program
Global memory:8573157376, Global memory cache: 16384, local memory: 65536, workgroup size: 256, Work dimensions: 3[1024, 1024, 1024, 0, 0] , Max clock speed:1630, compute units:64
Compiling kernels (build options: "-I. -DVECTOR_SIZE=2 -O3 -DMORE_CLASSES").
BUILD OUTPUT
warning: argument unused during compilation: '-I .'
1 warning generated.
END OF BUILD OUTPUT
Error 0 (Success): clBuildProgram
Perftest
Generate list of the first 1000000 primes: 2484.60 ms
Generate list of the first 1075766 primes for GPU sieving: 193.66 ms
1. CPU-Sieve-Init (once per class, 960 times per test, avg. for 1 iterations)
Init_class(sieveprimes= 5000): 0.78 ms
Init_class(sieveprimes= 20000): 3.00 ms
Init_class(sieveprimes= 80000): 12.49 ms
Init_class(sieveprimes= 200000): 32.72 ms
Init_class(sieveprimes= 500000): 86.46 ms
Init_class(sieveprimes=1000000): 178.87 ms
2. CPU-Sieve (output rate M/s)
Sieve size is fixed at compile time, cannot test with variable sizes. Just running 3 fixed tests.
SievePrimes: 254 396 611 945 1460 2257 3487 5389 8328 12871 19890 30738 47503 73411 113449 175323 270944 418716 647083 1000000
SieveSizeLimit
36 kiB 172.3 138.2 112.8 94.2 80.8 70.3 61.9 55.1 49.5 44.6 40.3 36.8 32.1 27.9 23.9 20.5 17.6 14.7 11.8 9.5
36 kiB 172.7 138.3 112.8 94.2 80.8 70.3 61.9 55.1 49.5 44.6 40.3 36.8 32.0 27.8 24.1 20.5 17.6 14.6 12.0 9.5
36 kiB 172.8 138.2 112.9 94.2 80.8 70.2 61.9 55.2 49.5 44.6 40.3 36.8 32.1 27.8 24.1 20.5 17.6 14.6 12.0 9.5
Best SieveSizeLimit for
SievePrimes: 254 396 611 945 1460 2257 3487 5389 8328 12871 19890 30738 47503 73411 113449 175323 270944 418716 647083 1000000
at kiB: 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36
max M/s: 172.8 138.3 112.9 94.2 80.8 70.3 61.9 55.2 49.5 44.6 40.3 36.8 32.1 27.9 24.1 20.5 17.6 14.7 12.0 9.5
Survivors: 36.41% 34.06% 32.06% 30.28% 28.69% 27.27% 26.00% 24.84% 23.78% 22.82% 21.94% 21.12% 20.36% 19.66% 19.01% 18.40% 17.83% 17.29% 16.79% 16.32%
removal rate 301.7 267.9 239.3 217.0 200.9 187.4 176.2 166.9 158.6 151.0 143.4 137.5 125.5 113.9 102.8 91.1 81.1 70.5 59.3 48.6
3. Memory copy to GPU (blocks of 8388608 bytes)
Standard copy, standard queue:
80 MB in 0.0 ms (2267191.4 MB/s) (real)
Standard copy, profiled queue:
80 MB in 0.0 ms (5592405.3 MB/s) (real)
80 MB in 0.0 ms (30404523.4 MB/s) (profiled data)
8 MB in 0.0 ms (39945752.4 MB/s) (profiled data, peak)
Standard copy, two queues:
80 MB in 0.0 ms (3495253.3 MB/s) (real)
Reinitializing with gpu_sieving enabled.
Select device - Get device info - Device 2/2: gfx900 (Advanced Micro Devices, Inc.),
device version: OpenCL 2.0 , driver version: 3098.0 (HSA1.1,LC)
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program
Global memory:8573157376, Global memory cache: 16384, local memory: 65536, workgroup size: 256, Work dimensions: 3[1024, 1024, 1024, 0, 0] , Max clock speed:1630, compute units:64
Compiling kernels (build options: "-I. -DVECTOR_SIZE=2 -O3 -DMORE_CLASSES -DCL_GPU_SIEVE").
BUILD OUTPUT
warning: argument unused during compilation: '-I .'
1 warning generated.
END OF BUILD OUTPUT
Error 0 (Success): clBuildProgram
4. GPU sieve, 1 iterations each
gpusieve_init: 6.518000 ms (CPU work)
gpusieve_init_exponent: 0.580500 ms (CalcModularInverses)
Memory access fault by GPU node-2 (Agent handle: 0x5655576e6e10) on address 0x7efcfc680000. Reason: Page not present or supervisor privilege.
Aborted
@valeriob01 Can this issue be closed? Is it still reproducible?
./mfakto -d 01 --perftest 1 mfakto 0.15pre6 (64bit build)
Runtime options Inifile mfakto.ini Verbosity 1 SieveOnGPU yes MoreClasses yes GPUSievePrimes 81157 GPUSieveProcessSize 24Ki bits GPUSieveSize 96Mi bits FlushInterval 0 WorkFile worktodo.txt ResultsFile results.txt Checkpoints enabled CheckpointDelay 300s Stages enabled StopAfterFactor class PrintMode compact V5UserID none ComputerID none TimeStampInResults yes VectorSize 2 GPUType AUTO SmallExp no UseBinfile mfakto_Kernels.elf Select device - Get device info: WARNING: Unknown GPU name, assuming GCN. Please post the device name "gfx900 (Advanced Micro Devices, Inc.)" to http://www.mersenneforum.org/showthread.php?t=15646 to have it added to mfakto. Set GPUType in mfakto.ini to select a GPU type yourself to avoid this warning.
OpenCL device info name gfx900 (Advanced Micro Devices, Inc.) device (driver) version OpenCL 2.0 (3098.0 (HSA1.1,LC)) maximum threads per block 1024 maximum threads per grid 1073741824 number of multiprocessors 64 (4096 compute elements) clock rate 1630MHz
Automatic parameters threads per grid 2097152 optimizing kernels for GCN
Compiling kernels.
Perftest
Generate list of the first 1075766 primes: 106.64 ms
CPU-Sieve-Init (once per class, 960 times per test, avg. for 1 iterations) Init_class(sieveprimes= 5000): 0.40 ms Init_class(sieveprimes= 20000): 1.68 ms Init_class(sieveprimes= 80000): 7.47 ms Init_class(sieveprimes= 200000): 19.22 ms Init_class(sieveprimes= 500000): 50.81 ms Init_class(sieveprimes=1000000): 106.06 ms
CPU-Sieve (output rate M/s) Sieve size is fixed at compile time, cannot test with variable sizes. Just running 3 fixed tests.
SievePrimes: 254 396 611 945 1460 2257 3487 5389 8328 12871 19890 30738 47503 73411 113449 175323 270944 418716 647083 1000000 SieveSizeLimit 36 kiB 574.5 523.2 487.0 447.3 405.0 377.5 348.5 318.6 289.2 257.9 222.9 201.5 164.1 126.3 99.3 79.3 64.0 50.9 39.2 30.5 36 kiB 582.8 534.6 490.9 443.9 407.0 376.9 350.3 319.1 288.7 258.8 226.2 203.2 164.4 125.9 100.0 80.0 64.2 50.1 39.9 30.4 36 kiB 571.3 525.6 487.5 437.1 407.2 375.3 344.6 317.1 288.1 258.7 227.0 203.1 164.3 126.1 99.7 80.0 64.5 50.8 39.8 30.6
Best SieveSizeLimit for SievePrimes: 254 396 611 945 1460 2257 3487 5389 8328 12871 19890 30738 47503 73411 113449 175323 270944 418716 647083 1000000 at kiB: 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 max M/s: 582.8 534.6 490.9 447.3 407.2 377.5 350.3 319.1 289.2 258.8 227.0 203.2 164.4 126.3 100.0 80.0 64.5 50.9 39.9 30.6 Survivors: 36.41% 34.06% 32.06% 30.28% 28.69% 27.27% 26.00% 24.84% 23.78% 22.82% 21.94% 21.12% 20.36% 19.66% 19.01% 18.40% 17.83% 17.29% 16.79% 16.32% removal rate 1017.9 1035.2 1040.3 1030.2 1012.0 1006.6 996.9 965.5 926.7 875.4 808.0 759.2 642.9 516.3 426.1 354.8 297.5 243.5 197.6 156.8
Memory copy to GPU (blocks of 8388608 bytes)
Standard copy, standard queue: 80 MB in 15.9 ms (5279.2 MB/s) (real)
Standard copy, profiled queue: 80 MB in 16.0 ms (5229.5 MB/s) (real) 80 MB in 16.0 ms (5236.9 MB/s) (profiled data) 8 MB in 1.5 ms (5468.6 MB/s) (profiled data, peak)
Standard copy, two queues: 80 MB in 14.4 ms (5820.6 MB/s) (real)
Reinitializing with gpu_sieving enabled. Select device - Get device info:
OpenCL device info name gfx900 (Advanced Micro Devices, Inc.) device (driver) version OpenCL 2.0 (3098.0 (HSA1.1,LC)) maximum threads per block 1024 maximum threads per grid 1073741824 number of multiprocessors 64 (4096 compute elements) clock rate 1630MHz
Automatic parameters threads per grid 2097152 optimizing kernels for GCN
Compiling kernels.
GPU sieve, 1 iterations each GPUSievePrimes (adjusted) 52534 GPUsieve minimum exponent 646182
gpusieve_init: 29.551000 ms (CPU work) gpusieve_init_exponent: 0.056000 ms (CalcModularInverses) Memory access fault by GPU node-1 (Agent handle: 0x5655576e89d0) on address 0x7efcfffbf000. Reason: Unknown. Aborted