data61 / cuda-fixnum

Extended-precision modular arithmetic library that targets CUDA.
Other
41 stars 28 forks source link

Exponentiation benchmark pegs CPU, hangs forever #69

Closed imeckler closed 5 years ago

imeckler commented 5 years ago

When running the bench program I receive the following output:

$ ./bench/bench 10000   
Function: mul_lo, #elts: 10e3
fixnum digit  total data   time       Kops/s
 bits  bits     (MiB)    (seconds)
   32    32       0.0     0.000      833333.3
   64    32       0.1     0.000      833333.3
  128    32       0.2     0.000      909090.9
  256    32       0.3     0.000      769230.8
  512    32       0.6     0.000      526315.8
 1024    32       1.2     0.000      263157.9

   64    64       0.1     0.000      833333.3
  128    64       0.2     0.000      588235.3
  256    64       0.3     0.000      769230.8
  512    64       0.6     0.000      625000.0
 1024    64       1.2     0.000      370370.4
 2048    64       2.4     0.000      149253.7

Function: mul_wide, #elts: 10e3
fixnum digit  total data   time       Kops/s
 bits  bits     (MiB)    (seconds)
   32    32       0.0     0.000     1000000.0
   64    32       0.1     0.000      909090.9
  128    32       0.2     0.000      909090.9
  256    32       0.3     0.000      769230.8
  512    32       0.6     0.000      555555.6
 1024    32       1.2     0.000      294117.6

   64    64       0.1     0.000      833333.3
  128    64       0.2     0.000      833333.3
  256    64       0.3     0.000      769230.8
  512    64       0.6     0.000      625000.0
 1024    64       1.2     0.000      384615.4
 2048    64       2.4     0.000      151515.2

Function: sqr_wide, #elts: 10e3
fixnum digit  total data   time       Kops/s
 bits  bits     (MiB)    (seconds)
   32    32       0.0     0.000      909090.9
   64    32       0.1     0.000      909090.9
  128    32       0.2     0.000      833333.3
  256    32       0.3     0.000      714285.7
  512    32       0.6     0.000      454545.5
 1024    32       1.2     0.000      217391.3

   64    64       0.1     0.000      909090.9
  128    64       0.2     0.000      833333.3
  256    64       0.3     0.000      833333.3
  512    64       0.6     0.000      625000.0
 1024    64       1.2     0.000      322580.6
 2048    64       2.4     0.000      131578.9

Function: modexp redc, #elts: 0e3
fixnum digit  total data   time       Kops/s
 bits  bits     (MiB)    (seconds)

This has not completed after several minutes. As this is running, the process is using 100% CPU, so there is probably an infinite loop somewhere.

unzvfu commented 5 years ago

Thanks for reporting the issue. The latest version compiles and runs fine for me. Are you running cuda-fixnum 0.2.1? Could you please post your build setup (e.g. OS, nvcc --version, nvidia-smi etc.)?

unzvfu commented 5 years ago

Right, I have reproduced this on CC 7+ (Volta), so it is probably related to the intra-warp independent thread scheduling introduced in Volta. I can't see anything else in the changes from CC 6 to CC 7 that might cause a problem. Oddly enough, the function in which the execution stalls is __ballot_sync (which was removed in CC 7), but only in some specific cases, other cases are unaffected.

unzvfu commented 5 years ago

I believe this is finally fixed in 759bac3f10366d9674b723c755dcd774797e428b (including the previous two commits 1f1a432a40e6e93d199ccc35480c9fe971ac6238 and d6cf0093f510d620a17ba06b2909432777c201a6).

unzvfu commented 5 years ago

@jkrauska is still experiencing this issue so re-opening.

jkrauska commented 5 years ago

My mistake -- I was operating on my own fork and I had not yet included your changes.

It's all working.

$ bin/bench 5000000
Function: mul_lo, #elts: 5000e3
fixnum digit  total data   time       Kops/s
 bits  bits     (MiB)    (seconds)
   32    32      19.1     0.000    56818181.8
   64    32      38.1     0.000    28571428.6
  128    32      76.3     0.000    13333333.3
  256    32     152.6     0.001     5434782.6
  512    32     305.2     0.003     1679543.2
 1024    32     610.4     0.008      630278.6

   64    64      38.1     0.000    30674846.6
  128    64      76.3     0.000    15974440.9
  256    64     152.6     0.001     8143322.5
  512    64     305.2     0.001     4038772.2
 1024    64     610.4     0.004     1146000.5
 2048    64    1220.7     0.017      302078.3

Function: mul_wide, #elts: 5000e3
fixnum digit  total data   time       Kops/s
 bits  bits     (MiB)    (seconds)
   32    32      19.1     0.000    57471264.4
   64    32      38.1     0.000    30487804.9
  128    32      76.3     0.000    15822784.8
  256    32     152.6     0.001     7704160.2
  512    32     305.2     0.002     3021148.0
 1024    32     610.4     0.007      727167.0

   64    64      38.1     0.000    30674846.6
  128    64      76.3     0.000    15923566.9
  256    64     152.6     0.001     8169934.6
  512    64     305.2     0.001     4048583.0
 1024    64     610.4     0.004     1212121.2
 2048    64    1220.7     0.016      311003.3

Function: sqr_wide, #elts: 5000e3
fixnum digit  total data   time       Kops/s
 bits  bits     (MiB)    (seconds)
   32    32      19.1     0.000    57471264.4
   64    32      38.1     0.000    30303030.3
  128    32      76.3     0.000    15873015.9
  256    32     152.6     0.001     7704160.2
  512    32     305.2     0.003     1708233.7
 1024    32     610.4     0.011      471742.6

   64    64      38.1     0.000    30674846.6
  128    64      76.3     0.000    15923566.9
  256    64     152.6     0.001     8169934.6
  512    64     305.2     0.001     4029008.9
 1024    64     610.4     0.006      897182.8
 2048    64    1220.7     0.021      236708.8

Function: modexp redc, #elts: 50e3
fixnum digit  total data   time       Kops/s
 bits  bits     (MiB)    (seconds)
   32    32       0.2     0.006        8436.0
   64    32       0.4     0.011        4478.7
  128    32       0.8     0.014        3525.1
  256    32       1.5     0.037        1342.1
  512    32       3.1     0.132         378.1
 1024    32       6.1     0.368         135.9

   64    64       0.4     0.011        4416.6
  128    64       0.8     0.021        2342.4
  256    64       1.5     0.040        1238.5
  512    64       3.1     0.089         564.0
 1024    64       6.1     0.417         119.8
 2048    64      12.2     1.480          33.8

Function: modexp cios, #elts: 50e3
fixnum digit  total data   time       Kops/s
 bits  bits     (MiB)    (seconds)
   32    32       0.2     0.006        7770.0
   64    32       0.4     0.009        5423.6
  128    32       0.8     0.015        3395.6
  256    32       1.5     0.019        2701.5
  512    32       3.1     0.044        1138.7
 1024    32       6.1     0.154         323.7

   64    64       0.4     0.010        5194.8
  128    64       0.8     0.014        3627.7
  256    64       1.5     0.023        2158.1
  512    64       3.1     0.043        1174.7
 1024    64       6.1     0.171         293.0
 2048    64      12.2     0.863          58.0

Function: multi modexp redc, #elts: 50e3
fixnum digit  total data   time       Kops/s
 bits  bits     (MiB)    (seconds)
   32    32       0.2     0.007        7531.3
   64    32       0.4     0.015        3331.3
  128    32       0.8     0.016        3192.4
  256    32       1.5     0.040        1255.6
  512    32       3.1     0.129         387.7
 1024    32       6.1     0.373         134.1

   64    64       0.4     0.011        4494.0
  128    64       0.8     0.021        2359.0
  256    64       1.5     0.041        1223.1
  512    64       3.1     0.086         580.9
 1024    64       6.1     0.415         120.5
 2048    64      12.2     1.490          33.6

Function: multi modexp cios, #elts: 50e3
fixnum digit  total data   time       Kops/s
 bits  bits     (MiB)    (seconds)
   32    32       0.2     0.006        8302.9
   64    32       0.4     0.010        5058.7
  128    32       0.8     0.016        3069.0
  256    32       1.5     0.019        2650.8
  512    32       3.1     0.041        1209.9
 1024    32       6.1     0.155         322.4

   64    64       0.4     0.009        5301.7
  128    64       0.8     0.013        3804.9
  256    64       1.5     0.021        2326.0
  512    64       3.1     0.043        1175.9
 1024    64       6.1     0.171         292.0
 2048    64      12.2     0.870          57.5