Enhancement for transform operations

Added more information to message when failed to read matrix from file.
Reorganized functions related to computing rowBlocks they are now in csr-meta.hpp / cpp files instead of clsparse-coo2csr.cpp, rest of functions in clsparse-coo2csr were not used therfore name clsparse-coo2csr-GPU changed to proper one.
Csr_matrix_environment allocates matrix in double precision in the first place which is then casted to single precision.
Sanity checks in clsparseD/SCsrMatrixFromFile.
Added Inclusive and exclusive scans operations + tests.
Rewritten reduce by key operation + tests.
Rewriten coo2csr and csr2coo which no longer need the use of radix sort. I.e we now using 10 kernel calls instead of 34, simple tests showed 6x speedup.
Rewritten dense2csr and csr2dense in more clean way. Improved performance eliminating unnecessary copies.

Minor:

When reading matix in coo format directly the data need to be also sorted by (row, col). Otherwise we will have column major format which is default for mtx storage.

Closes the #131

clMathLibraries / clSPARSE