Juzhen is a set of C++ APIs for matrix operations. It provides a higher level interface for lower-level numerical calculation software like CBLAS and CUDA. It supports Neural Net API similar to the ones used in PyTorch or Tensorflow.
This API is developed under C++17 and CUDA 11.8.
You can perform matrix operations like this:
#include <iostream>
#include "cpp/juzhen.hpp"
using namespace std;
int compute(){
// declare matrices like you would in MATLAB or Numpy.
Matrix<float> A = {"A", {{1,2,3},{4,5,6}}};
cout << A << endl;
Matrix<float> B = {"B", {{.1,.2},{.3,.4},{.5,.6}}};
cout << B << endl << endl;
//compute with matrices as if you are using MATLAB.
cout << log(exp(A*B)+1.0f)/5.0f << endl;
return 0;
}
Then compile:
mkdir bin
clang++ -x c++ -std=c++20 -DCPU_ONLY -DLOGGING_OFF -O3 ./cpp/launcher.cu ./helloworld.cu -o bin/helloworld.out -lopenblas
bin/helloworld.out
or on GPU:
#include <iostream>
#include "cpp/juzhen.hpp"
using namespace std;
int compute(){
//matrices on GPU
Matrix<CUDAfloat> A(Matrix<float>("A",{{1,2,3},{4,5,6}}));
cout << A << endl;
Matrix<CUDAfloat> B(Matrix<float>("B",{{.1,.2},{.3,.4},{.5,.6}}));
cout << B << endl << endl;
//only when you try to print, matrices are downloaded from the GPU.
cout << (log(exp(A*B)+1.0f)/5.0f) << endl;
return 0;
}
Then compile:
mkdir bin
nvcc -std=c++20 -DLOGGING_OFF -O3 ./cpp/launcher.cu ./cpp/cumatrix.cu ./cpp/cukernels.cu ./helloworld.cu -o bin/helloworld.out -lcublas -lcurand
bin/helloworld.out
They both prints out:
A 2 by 3
1 2 3
4 5 6
B 3 by 2
0.1 0.2
0.3 0.4
0.5 0.6
logM 2 by 2
0.461017 0.571807
0.981484 1.28033
You can verify the result using MATLAB:
>> A = [1,2,3;4,5,6];
>> B = [.1,.2;.3,.4;.5,.6];
>> (log(exp(A*B)+1.0)./5.0)
ans =
0.4610 0.5718
0.9815 1.2803
>>
Juzhen also supports matrix slicing:
#include <iostream>
using namespace std;
#include "cpp/juzhen.hpp"
int compute() {
Matrix<float> A = { "A", {{1,2,3},{4,5,6}} };
cout << A.columns(0, 2).inv() << endl;
return 0;
}
The code above is the same as the following MATLAB code:
>> A = [1,2,3;4,5,6];
>> inv(A(:,1:2))
ans =
-1.6667 0.6667
1.3333 -0.3333
You can also use a higher level Neural Net API to write Neural Net applications:
//helloworldnn.cu
#include "ml/layer.hpp"
using namespace std;
using namespace Juzhen;
#define FLOAT float
inline Matrix<float> randn(int m, int n)
{
return Matrix<float>::randn(m, n);
}
inline Matrix<float> ones(int m, int n) { return Matrix<float>::ones(m, n); }
#define MatrixI Matrix<int>
int compute()
{
// problem set up
const int n = 5000, batchsize = 50, numbatches = n / batchsize;
// regression dataset generation
auto X = randn(10, n), beta = randn(10, 1);
auto Y = beta.T() * X + randn(1, n);
auto XT = randn(10, n);
auto YT = beta.T() * XT + randn(1, n);
// define layers
Layer<float> L0(16, 10, batchsize), L1(4, 16, batchsize);
LinearLayer<float> L2(1, 4, batchsize);
// least sqaure loss
LossLayer<float> L3t(n, YT);
// nns are linked lists containing layers
list<Layer<float> *> trainnn({&L2, &L1, &L0}), testnn({&L3t, &L2, &L1, &L0});
// sgd
int iter = 0;
while (iter < 10000)
{
int batch_id = (iter % numbatches);
// obtaining batches
auto X_i = X.columns(batchsize * batch_id, batchsize * (batch_id + 1));
auto Y_i = Y.columns(batchsize * batch_id, batchsize * (batch_id + 1));
// forward-backward pass
forward(trainnn, X_i);
LossLayer<float> L3(batchsize, Y_i);
trainnn.push_front(&L3);
backprop(trainnn, X_i);
trainnn.pop_front();
// print progress
if (iter % 1000 == 0)
{
cout << "testing loss: " << Juzhen::forward(testnn, XT).elem(0, 0) << endl;
}
iter++;
}
return 0;
}
Then compile:
mkdir bin
clang++ -x c++ -std=c++20 -DCPU_ONLY -DLOGGING_OFF -O3 ./cpp/launcher.cu ./helloworldnn.cu -o bin/helloworldnn.out -lopenblas
bin/helloworldnn.out
NO(*). MATLAB or NumPy are popular among data scientists for a reason. If you are interested in numerical C++ coding, I recommend Eigen or Armadillo.
This software is written with educational purposes in mind. If you still want to use it, please read on.
First, install CBLAS and LAPACK and set environment varibles.
In Ubuntu,
sudo apt install libopenblas-dev liblapack-dev
In Windows, you can download precompiled binaries from OpenBLAS.
In MacOS, BLAS and LAPACK are parts of Acclerate Framework, comes with XCode.
To use Juzhen without GPU support, copy the "cpp" folder to your C++ project directory and include the header file juzhen.hpp
.
Suppose you wrote your code in main.cu
. Compile it using
clang++ -x c++ -I cpp/ -DLOGGING_OFF -D CPU_ONLY -O3 cpp/launcher.cu main.cu -o bin/main.out -llapack -lopenblas
Do not forget to link lapack and BLAS!
If you want to use GPU APIs, you need to install CUDA and compile your code using nvcc
:
nvcc -I cpp/ -O3 -DLOGGING_OFF main.cu cpp/cumatrix.cu cpp/cukernels.cu cpp/launcher.cu -o bin/main.out -lcublas -lcurand -llapack -lopenblas --extended-labmda
Note that we dropped -D CPU_ONLY
flag.
# to build all examples, using CMAKE
mkdir build
cd build && cmake ..
cmake --build . --config=Release
# to run
./helloworld
std::move
SementicsConsider the following examples:
Matrix<float> A = {"A",{{1,2},{3,4},{5,6}}}; //A is created. Memory allocated.
auto B = A; //B is a copy of A. Extra space allocated for B.
B.zeros(); //B is zero, but A remains the same.
Matrix<float> A = {"A",{{1,2},{3,4},{5,6}}};
auto B = std::move(A); // Transfer the memory owned by A to B.
Matrix<float> A = {"A",{{1,2},{3,4},{5,6}}};
auto B = exp(A); // New memory allocated for B.
B.zeros(); // B is zero, but A is not affected.
Matrix<float> A = {"A",{{1,2},{3,4},{5,6}}};
auto B = exp(std::move(A)); // Using the memory space owned by A to create B. A does not own any memory any more.
Allocating memory is expensive, so make good use of std::move
semantics and steal memory from sacrificial intermediate results.
The allocated memory will not be immediately released, so later computations can reclaim those spaces without calling memory allocation functions. To release the memory before the scope exits, please remember to add a line at the begining of the scope:
MemoryDeleter<T> md1;
where T
is the type of your matrix.
main.cpp
#include <iostream>
using namespace std;
#include "cpp/juzhen.hpp"
int compute(){
Matrix<float> A = Matrix<float>::randn(500, 1000);
{Profiler p; //start the profiler
for (int i = 0; i < 1000; i++)
{
auto &&C = A * A.T();
}
}// profiler will automatically stop and print out elapsed time when the current scope exits.
return 0;
}
Compile and Run:
$ clang++ -x c++ -I cpp/ -DLOGGING_OFF -D CPU_ONLY -O3 cpp/launcher.cu main.cu -o bin/main.out -llapack -lopenblas
$ bin/main.out
Time: 1247.48 ms
Total memory released: 2.86102 MB.
Compare it with MATLAB:
>> A = randn(500,1000,'single');
tic;
for i=1:1000
C = A*A';
end;
toc;
Elapsed time is 0.759065 seconds.
Benchmark using MNIST example, time collected by the built-in profiling tool.