⚠️ I don't plan to merge this PR, but am requesting review for the sake of knowledge transfer. ⚠️
This PR tests out the use of taichi for the comps algorithm rather than numba.
Connects #229.
Findings
See Benchmarking below for detailed stats comparing this approach to our existing code on different architectures. Some high-level takeaways:
CUDA doesn't seem to make much of a difference, and is counterproductive if anything. This makes me wonder whether the algorithm needs to be redesigned to make better use of the GPU (note that numba has an entirely separate interface for CUDA programming), but I'm considering that question out of scope for now.
There are big performance gains to be had by simply bumping the instance type with the existing numba code. If the numbers below hold, we could speed up the comps code by 2x if we switched to c5.24xlarge instances, which are about twice as expensive as the m4.10xlarge instances we use now (meaning we should expect to break even on the change).
At small scales (20k observations/10k comparisons), taichi appears to outperform numba, but this improvement disappears if we scale up the size of the data. At a large scale (100k observations/50k comparisons), they perform about the same.
As evidenced by the code in this PR, the taichi interface is harder to work with than numba. Taichi is more strict about types, and doesn't support some basic operations that Python does (most notably, getting the shape of an array and returning an array from a function) making the code more confusing and un-Pythonic.
⚠️ I don't plan to merge this PR, but am requesting review for the sake of knowledge transfer. ⚠️
This PR tests out the use of taichi for the comps algorithm rather than numba.
Connects #229.
Findings
See
Benchmarking
below for detailed stats comparing this approach to our existing code on different architectures. Some high-level takeaways:Benchmarking
20k observations, 10k comparisons
100k observations, 50k comparisons