Single precision can be slower than double on GPU

This code has quite peculiar behavior on the OpenCL backend:

import bohrium as np
import time

DTYPE = np.float32

if __name__ == "__main__":
    a = np.random.rand(100,10**6, dtype=DTYPE)
    np.flush()

    start = time.time()
    for _ in xrange(100):
        for i in xrange(len(a)):
            a[0,0] = np.sum(a[i])
        np.flush()
    end = time.time()
    print(end-start)

First of all, for float32, the execution time is about twice as high as for float64. On top of that, it seems like everything is copied to and from the CPU all the time, and I don't see any reason for that. Running with BH_OPENCL_VERBOSE=1 floods stdout with these messages:

Offloading to CPU
Copy to host: a1162{dtype: BH_FLOAT64, nelem: 40000, address: 0x5603dfc80f40}
Copy to device: a303{dtype: BH_FLOAT64, nelem: 1, address: 0x5603df798f20}

Profiler output for float32:

[OpenCL] Profiling: 
Fuse cache hits:                 100/103 (97.0874%)
Kernel cache hits                20000/20001 (99.995%)
Array contractions:              2/60003 (0.00333317%)
Outer-fusion ratio:              301/505 (59.604%)

Max memory usage:                381.47 MB
Syncs to NumPy:                  0
Total Work:                      10600000000 operations
Throughput:                      8.34016e+08ops
Work below par-threshold (1000): 0.000188679%

Wall clock:                      12.7096s
Total Execution:                 10.4416s
  Pre-fusion:                    0.000831275s
  Fusion:                        0.0252068s
  Codegen:                       1.28743s
  Compile:                       0.0912269s
  Exec:                          2.34692s
  Copy2dev:                      1.87636s
  Copy2host:                     1.62509s
  Ext-method:                    0s
  Offload:                       3.77476s
  Other:                         -0.586248s

Unaccounted for (wall - total):  2.26801s

and for float64:

[OpenCL] Profiling: 
Fuse cache hits:                 100/103 (97.0874%)
Kernel cache hits                20000/20001 (99.995%)
Array contractions:              2/60003 (0.00333317%)
Outer-fusion ratio:              301/505 (59.604%)

Max memory usage:                762.939 MB
Syncs to NumPy:                  0
Total Work:                      10600000000 operations
Throughput:                      9.53001e+08ops
Work below par-threshold (1000): 0.000188679%

Wall clock:                      11.1228s
Total Execution:                 8.90308s
  Pre-fusion:                    0.000827866s
  Fusion:                        0.024265s
  Codegen:                       1.3059s
  Compile:                       0.093176s
  Exec:                          1.18055s
  Copy2dev:                      1.85703s
  Copy2host:                     1.63883s
  Ext-method:                    0s
  Offload:                       3.41141s
  Other:                         -0.608905s

Unaccounted for (wall - total):  2.21967s

bh107 / bohrium

Single precision can be slower than double on GPU #439