[Do not merge] Spike upgrading comps algorithm with taichi

⚠️ I don't plan to merge this PR, but am requesting review for the sake of knowledge transfer. ⚠️

This PR tests out the use of taichi for the comps algorithm rather than numba.

Connects #229.

Findings

See Benchmarking below for detailed stats comparing this approach to our existing code on different architectures. Some high-level takeaways:

CUDA doesn't seem to make much of a difference, and is counterproductive if anything. This makes me wonder whether the algorithm needs to be redesigned to make better use of the GPU (note that numba has an entirely separate interface for CUDA programming), but I'm considering that question out of scope for now.
There are big performance gains to be had by simply bumping the instance type with the existing numba code. If the numbers below hold, we could speed up the comps code by 2x if we switched to c5.24xlarge instances, which are about twice as expensive as the m4.10xlarge instances we use now (meaning we should expect to break even on the change).
At small scales (20k observations/10k comparisons), taichi appears to outperform numba, but this improvement disappears if we scale up the size of the data. At a large scale (100k observations/50k comparisons), they perform about the same.
As evidenced by the code in this PR, the taichi interface is harder to work with than numba. Taichi is more strict about types, and doesn't support some basic operations that Python does (most notably, getting the shape of an array and returning an array from a function) making the code more confusing and un-Pythonic.

framework	instance type	arch	time	logs
taichi	g5.12xlarge	x86	2.36s	link
taichi	g5.12xlarge	CUDA	4.33s	link
taichi	m4.10xlarge	x86	4.44s	link
numba	g5.12xlarge	x86	6.07s	link
numba	m4.10xlarge	x86	10.52s	link

framework	instance type	arch	time	logs
numba	g5.12xlarge	x86	31.87s	link
taichi	c5.24xlarge	x86	31.93s	link
taichi	m4.10xlarge	x86	34.09s	link
numba	c5.24xlarge	x86	37.31s	link
taichi	g5.12xlarge	x86	37.75s	link
taichi	g5.12xlarge	CUDA	43.58s	link
numba	m4.10xlarge	x86	64.19s	link