sum() does not parallelize well if it only produces 1 result

etaler / Etaler

A flexable HTM (Hierarchical Temporal Memory) framework with full GPU support.

BSD 3-Clause "New" or "Revised" License

89 stars 14 forks source link

sum() does not parallelize well if it only produces 1 result #118

Closed marty1885 closed 4 years ago

marty1885 commented 4 years ago

The current sum() implementation in both CPU and OpenCL backend parallelize base on how many results it needs to generate. So they only use 1 thread if we are summing the entire tensor into a single number. This is slow and inefficient.

marty1885 commented 4 years ago

Hand wavy analysis

I have came up with a reasonable parallelization strategy. For CPU: If there's more than num_threads/2 results to be generated, we assign 1 thread per result. Otherwise we assign num_thread/num_result.

For GPU: If num_result > num_compute_unit, we assign 1 processing element per result. Otherwise we assign a compute unit per result

marty1885 commented 4 years ago

Since we can't access the number of threads on TBB. I made it: if we are trying to generate 1 result, use all the cores on it.