Closed marty1885 closed 4 years ago
Hand wavy analysis
I have came up with a reasonable parallelization strategy. For CPU: If there's more than num_threads/2
results to be generated, we assign 1 thread per result. Otherwise we assign num_thread/num_result
.
For GPU: If num_result > num_compute_unit
, we assign 1 processing element per result. Otherwise we assign a compute unit per result
Since we can't access the number of threads on TBB. I made it: if we are trying to generate 1 result, use all the cores on it.
The current
sum()
implementation in both CPU and OpenCL backend parallelize base on how many results it needs to generate. So they only use 1 thread if we are summing the entire tensor into a single number. This is slow and inefficient.