This code has quite peculiar behavior on the OpenCL backend:
import bohrium as np
import time
DTYPE = np.float32
if __name__ == "__main__":
a = np.random.rand(100,10**6, dtype=DTYPE)
np.flush()
start = time.time()
for _ in xrange(100):
for i in xrange(len(a)):
a[0,0] = np.sum(a[i])
np.flush()
end = time.time()
print(end-start)
First of all, for float32, the execution time is about twice as high as for float64. On top of that, it seems like everything is copied to and from the CPU all the time, and I don't see any reason for that. Running with BH_OPENCL_VERBOSE=1 floods stdout with these messages:
Offloading to CPU
Copy to host: a1162{dtype: BH_FLOAT64, nelem: 40000, address: 0x5603dfc80f40}
Copy to device: a303{dtype: BH_FLOAT64, nelem: 1, address: 0x5603df798f20}
This is because OpenCL and CUDA cannot reduce to a scalar at the moment.
We have a student working on efficient reductions on the GPU, which should fix this issue.
This code has quite peculiar behavior on the OpenCL backend:
First of all, for
float32
, the execution time is about twice as high as forfloat64
. On top of that, it seems like everything is copied to and from the CPU all the time, and I don't see any reason for that. Running withBH_OPENCL_VERBOSE=1
floods stdout with these messages:Profiler output for
float32
:and for
float64
: