v0.6 -- InCoreFalkon, CUDA LAUUM, Bug fixes

Large pull request to incorporate several changes.

The driving change was the implementation of an in-core version of Falkon, suitable for smaller data analyses. Here the data is always kept inside the GPU, thus the model can train much faster. The result is the InCoreFalkon class.

LAUUM was improved to use a CUDA implementation for the inner-loop function.