Closed imeckler closed 5 years ago
Thanks for reporting the issue. The latest version compiles and runs fine for me. Are you running cuda-fixnum 0.2.1? Could you please post your build setup (e.g. OS, nvcc --version
, nvidia-smi
etc.)?
Right, I have reproduced this on CC 7+ (Volta), so it is probably related to the intra-warp independent thread scheduling introduced in Volta. I can't see anything else in the changes from CC 6 to CC 7 that might cause a problem. Oddly enough, the function in which the execution stalls is __ballot_sync
(which was removed in CC 7), but only in some specific cases, other cases are unaffected.
I believe this is finally fixed in 759bac3f10366d9674b723c755dcd774797e428b (including the previous two commits 1f1a432a40e6e93d199ccc35480c9fe971ac6238 and d6cf0093f510d620a17ba06b2909432777c201a6).
@jkrauska is still experiencing this issue so re-opening.
My mistake -- I was operating on my own fork and I had not yet included your changes.
It's all working.
$ bin/bench 5000000
Function: mul_lo, #elts: 5000e3
fixnum digit total data time Kops/s
bits bits (MiB) (seconds)
32 32 19.1 0.000 56818181.8
64 32 38.1 0.000 28571428.6
128 32 76.3 0.000 13333333.3
256 32 152.6 0.001 5434782.6
512 32 305.2 0.003 1679543.2
1024 32 610.4 0.008 630278.6
64 64 38.1 0.000 30674846.6
128 64 76.3 0.000 15974440.9
256 64 152.6 0.001 8143322.5
512 64 305.2 0.001 4038772.2
1024 64 610.4 0.004 1146000.5
2048 64 1220.7 0.017 302078.3
Function: mul_wide, #elts: 5000e3
fixnum digit total data time Kops/s
bits bits (MiB) (seconds)
32 32 19.1 0.000 57471264.4
64 32 38.1 0.000 30487804.9
128 32 76.3 0.000 15822784.8
256 32 152.6 0.001 7704160.2
512 32 305.2 0.002 3021148.0
1024 32 610.4 0.007 727167.0
64 64 38.1 0.000 30674846.6
128 64 76.3 0.000 15923566.9
256 64 152.6 0.001 8169934.6
512 64 305.2 0.001 4048583.0
1024 64 610.4 0.004 1212121.2
2048 64 1220.7 0.016 311003.3
Function: sqr_wide, #elts: 5000e3
fixnum digit total data time Kops/s
bits bits (MiB) (seconds)
32 32 19.1 0.000 57471264.4
64 32 38.1 0.000 30303030.3
128 32 76.3 0.000 15873015.9
256 32 152.6 0.001 7704160.2
512 32 305.2 0.003 1708233.7
1024 32 610.4 0.011 471742.6
64 64 38.1 0.000 30674846.6
128 64 76.3 0.000 15923566.9
256 64 152.6 0.001 8169934.6
512 64 305.2 0.001 4029008.9
1024 64 610.4 0.006 897182.8
2048 64 1220.7 0.021 236708.8
Function: modexp redc, #elts: 50e3
fixnum digit total data time Kops/s
bits bits (MiB) (seconds)
32 32 0.2 0.006 8436.0
64 32 0.4 0.011 4478.7
128 32 0.8 0.014 3525.1
256 32 1.5 0.037 1342.1
512 32 3.1 0.132 378.1
1024 32 6.1 0.368 135.9
64 64 0.4 0.011 4416.6
128 64 0.8 0.021 2342.4
256 64 1.5 0.040 1238.5
512 64 3.1 0.089 564.0
1024 64 6.1 0.417 119.8
2048 64 12.2 1.480 33.8
Function: modexp cios, #elts: 50e3
fixnum digit total data time Kops/s
bits bits (MiB) (seconds)
32 32 0.2 0.006 7770.0
64 32 0.4 0.009 5423.6
128 32 0.8 0.015 3395.6
256 32 1.5 0.019 2701.5
512 32 3.1 0.044 1138.7
1024 32 6.1 0.154 323.7
64 64 0.4 0.010 5194.8
128 64 0.8 0.014 3627.7
256 64 1.5 0.023 2158.1
512 64 3.1 0.043 1174.7
1024 64 6.1 0.171 293.0
2048 64 12.2 0.863 58.0
Function: multi modexp redc, #elts: 50e3
fixnum digit total data time Kops/s
bits bits (MiB) (seconds)
32 32 0.2 0.007 7531.3
64 32 0.4 0.015 3331.3
128 32 0.8 0.016 3192.4
256 32 1.5 0.040 1255.6
512 32 3.1 0.129 387.7
1024 32 6.1 0.373 134.1
64 64 0.4 0.011 4494.0
128 64 0.8 0.021 2359.0
256 64 1.5 0.041 1223.1
512 64 3.1 0.086 580.9
1024 64 6.1 0.415 120.5
2048 64 12.2 1.490 33.6
Function: multi modexp cios, #elts: 50e3
fixnum digit total data time Kops/s
bits bits (MiB) (seconds)
32 32 0.2 0.006 8302.9
64 32 0.4 0.010 5058.7
128 32 0.8 0.016 3069.0
256 32 1.5 0.019 2650.8
512 32 3.1 0.041 1209.9
1024 32 6.1 0.155 322.4
64 64 0.4 0.009 5301.7
128 64 0.8 0.013 3804.9
256 64 1.5 0.021 2326.0
512 64 3.1 0.043 1175.9
1024 64 6.1 0.171 292.0
2048 64 12.2 0.870 57.5
When running the
bench
program I receive the following output:This has not completed after several minutes. As this is running, the process is using 100% CPU, so there is probably an infinite loop somewhere.