Cuda Calibrate Tests Passing

This is a changeset to get CUDA leap code running identical to the single threaded implementation to about 6 decimal places.

One of the optimizations also made here is to load the visiibilities and uvws once into a single integration object in order to perform one cuda call per direction. this may not be possible to do with the too many baselines and channels as the gpu would run out of memory, however there is a task to report the compute and memory footprint for beta release where the batching of baselines can be restored afterwards.

This also adds a bugfix where only the first direction in the casa implementation was being calculated correctly.

ICRAR / leap-accelerate

Cuda Calibrate Tests Passing #42