Add namespacing and CUDA streams

Copied from commit messages:

Add namespacing to functions & constants

All of the current functions are in the top level namespace, which opens the door for potential naming collisions with other code if / when this code gets used as a part of a larger project somewhere else. Adding namespacing makes this substantially more likely.

I've added 2 namespaces, a top level flashgs (or FLASHGS_) in both the .h and .cu / .cpp files, as well as an additional anonymous namespace in the .cu / .cpp files to ensure names of utility functions don't collide within flashgs. The latter is currently unnecessary, but generally good practice in C++.

Add support for CUDA streams

It's good practice to allow users to specify which streams they'd like to run CUDA kernels on. I've exposed this functionality through the C++ layer, allowing the stream to be specified at the C++ interfaces. The default is still stream 0 (the default CUDA stream which was being used before). Python is also still using stream 0 and does not have the API exposed to use a different stream.

InternLandMark / FlashGS

Add namespacing and CUDA streams #8