Accelerate DirectLiNGAM by parallelising causal ordering on GPUs with CUDA

aknvictor commented 4 months ago

This PR includes the implementation drastically speed-up (up to 32x on consumer GPU) DirectLiNGAM and its variants e.g VarLiNGAM.

The details are to allow for an optional dependency: https://github.com/Viktour19/culingam which implements custom CUDA kernels for the pairwise likelihood ratio causal ordering method.

The implementation has been tested locally on an NVIDIA RTX 6000 on a Linux machine - but tests on other setups are needed.

firmai commented 4 months ago

Very interesting adaption, looking forward to it.

ikeuchi-screen commented 4 months ago

Hi @Viktour19 , thanks for your contribution!

First of all, I could not install culingam in my Windows environment with pip install culingam (although I could install it in my Linux environment).

Windows10 Pro
Python 3.9.12
CUDA Toolkit 12.2

I attempted a manual installation using the procedure shown here, but it failed.

With various modifications I was able to install. Please see below what I did and use it to improve culingam.

Environment variable CUDA_HOME

The environment variable CUDA_HOME is specified as a path in Windows. C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2

Inclusion of nvToolsExt.h

Changed the header inclusion method as shown in the official site. #include <nvtx3/nvToolsExt.h>

Compile error for M_PI

Added -D_USE_MATH_DEFINES to extra_compile_args of CUDAExtension. https://stackoverflow.com/questions/56319494/nvcc-compilation-errors-with-m-pi-and-or

nvToolsExt.lib is missing

Add the following path to library_dirs in CUDAExtension. C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64

And, change to nvToolsExt64_1.lib instead of nvToolsExt.lib. https://github.com/pytorch/pytorch/issues/101135

ikeuchi-screen commented 4 months ago

Hi @Viktour19, This is the result of comparing the same data with the existing DirectLiNGAM and GPU versions. The GPU version has the wrong causal order estimated. Is it a problem with my environment?

import numpy as np
import pandas as pd
import graphviz
import lingam
from lingam.utils import make_dot
print([np.__version__, pd.__version__, graphviz.__version__, lingam.__version__])
np.random.seed(0)

['1.25.2', '2.2.0', '0.20', '1.8.3']

Test Data

x2 = np.random.uniform(size=100000)
x0 = 3.0*x2 + np.random.uniform(size=100000)
x1 = 1.0*x0 + 6.0*x2 + np.random.uniform(size=100000)
X = pd.DataFrame(np.array([x0, x1, x2]).T ,columns=['x0', 'x1', 'x2'])
make_dot([[0.0, 0.0, 3.0], [1.0, 0.0, 6.0], [0.0, 0.0, 0.0]])

output_2_0

CPU

%%time
model = lingam.DirectLiNGAM()
model = model.fit(X)

CPU times: total: 156 ms Wall time: 169 ms

print('causal ordering:', model.causal_order_)
make_dot(model.adjacency_matrix_)

causal ordering: [2, 0, 1]

output_5_1

GPU

%%time
model = lingam.DirectLiNGAM(measure='pwling_fast')
model = model.fit(X)

CPU times: total: 141 ms Wall time: 205 ms

print('causal ordering:', model.causal_order_)
make_dot(model.adjacency_matrix_)

causal ordering: [0, 1, 2]

output_8_1

aknvictor commented 4 months ago

Thanks for documenting the Windows setup!

I couldn't reproduce the issue on mine. Here's the graph using the data provided: attached

Could you try running the example in DirectLiNGAM_fast.py? That includes an additional check that the compiler is available.

ikeuchi-screen commented 4 months ago

The output of the get_cuda_version function is as follows:

CUDA Version found:
 nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:42:34_Pacific_Daylight_Time_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

The culingam installed by pip install is v0.0.7, but in the github repository it is v0.0.6. I am using v0.0.6 installed manually from github on Windows. Is this due to a different version of culingam?

ikeuchi-screen commented 4 months ago

I installed culingam v0.07 on Linux with pip and ran DirectLiNGAM_fast.py, but got an AssertionError on assert np.allclose(model.adjacency_matrix_, m)

ikeuchi-screen commented 4 months ago

I tried to run it with only culingam v0.0.7. I ran the following code in the Kaggle environment, but the causal order was incorrect.

Execution Result

https://github.com/Viktour19/culingam/blob/e2380d138d980196894a675691a978aa92490ee5/examples/basic.py

!pip install culingam

Collecting culingam Downloading culingam-0.0.7.tar.gz (27 kB) Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from culingam) (1.26.4) Requirement already satisfied: tqdm in /opt/conda/lib/python3.10/site-packages (from culingam) (4.66.1) Building wheels for collected packages: culingam Building wheel for culingam (pyproject.toml) ... done Created wheel for culingam: filename=culingam-0.0.7-cp310-cp310-linux_x86_64.whl size=89289 sha256=b56e51c13260bece05ff0a9e4f17f81bc52f0c503ddb8bff87ddd669f0ab9eba Stored in directory: /root/.cache/pip/wheels/4d/90/ee/7192c3880f1d0903b6f0a50af63669c5b4f55107f44f120e78 Successfully built culingam Installing collected packages: culingam Successfully installed culingam-0.0.7

import numpy as np
import subprocess

# [[ 0.          0.          0.          2.99982982  0.          0.        ]
#  [ 2.99997222  0.          2.00008518  0.          0.          0.        ]
#  [ 0.          0.          0.          5.99981965  0.          0.        ]
#  [ 0.          0.          0.          0.          0.          0.        ]
#  [ 7.99857006  0.         -0.99911522  0.          0.          0.        ]
#  [ 3.99974733  0.          0.          0.          0.          0.        ]]
# [3, 0, 2, 5, 4, 1]

def get_cuda_version():
    try:
        nvcc_version = subprocess.check_output(["nvcc", "--version"]).decode('utf-8')
        print("CUDA Version found:\n", nvcc_version)
        return True
    except Exception as e:
        print("CUDA not found or nvcc not in PATH:", e)
        return False

def main():
    np.random.seed(42)
    size = 100000
    x3 = np.random.uniform(size=size)
    x0 = 3.0*x3 + np.random.uniform(size=size)
    x2 = 6.0*x3 + np.random.uniform(size=size)
    x1 = 3.0*x0 + 2.0*x2 + np.random.uniform(size=size)
    x5 = 4.0*x0 + np.random.uniform(size=size)
    x4 = 8.0*x0 - 1.0*x2 + np.random.uniform(size=size)

    X = np.array([x0, x1, x2, x3, x4, x5]).T

    dlm = DirectLiNGAM(12)
    dlm.fit(X, disable_tqdm=False)

    np.set_printoptions(precision=3, suppress=True)

    print(dlm._adjacency_matrix)
    print(dlm.causal_order_)

# Check for cuda availability before importing CUDA-dependent packages
if get_cuda_version():
    try:
        from culingam.directlingam import DirectLiNGAM
        main()

    except ImportError as e:
        print("Failed to import CUDA-dependent package:", e)
else:
    print("CUDA is not available. Please ensure CUDA is installed and correctly configured.")

CUDA Version found: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0

100%|██████████| 6/6 [00:00<00:00, 17.03it/s] [[ 0. 0. 0. 0. 0. 0. ] [ 6.596 0. 0. 0. 0. 0. ] [-1.331 0.474 0. 0. 0. 0. ] [ 0.065 0. 0.131 0. 0. 0. ] [ 8. 0. -1. 0. 0. 0. ] [ 3.999 0. 0. 0. 0. 0. ]] [0, 1, 2, 3, 4, 5]

aknvictor commented 4 months ago

Thanks for your patience! it seems I needed to allow for a broader range of CUDA gpu compute capability. E.g the P100 on Kaggle is sm_60. I've updated the package on PyPi and on Github. Let me know if that works.

ikeuchi-screen commented 4 months ago

@Viktour19 Thanks for responding! Both PyPI and GitHub worked fine! I'll check a little more to merge the code.

It would be great if you could support pip install culingam to install on Windows as well!

ikeuchi-screen commented 4 months ago

@Viktour19 You said the GPU was 32 times faster than the CPU, what number of variables and sample size data did you use? I tried the following combinations and found no difference between CPU and GPU. Number of variables: {10, 20, 50, 100} Sample size: {1000, 2000, 5000}

aknvictor commented 3 months ago

I benchmarked with samples: [1k to 1m] and dim: [10 to 100].

Here's the wall clock time for GPU on my setup. Can you share yours? How does this compare with CPU time on your setup?

Ps: I'm working on getting a Windows machine to test on.

ikeuchi-screen commented 3 months ago

I fixed the number of variables to 100 based on the heatmap you showed me. There was no difference when the sample size was less than 5000, but when the sample size was greater than that, the GPU was clearly faster!

aknvictor commented 3 months ago

Excellent!

ikeuchi-screen commented 3 months ago

I temporarily reverted because I found that the CI test did not pass and the docs build did not pass in an environment without culingam installed.

The error is due to the following code (direct_lingam.py):

from lingam_cuda import causal_order as causal_order_gpu

To avoid the error in the above code, we can install culingam. However, culingam cannot be installed without CUDA (and cannot pip install on Windows), which means that CUDA is required to use lingam.

ikeuchi-screen commented 3 months ago

Changed the import location and reverted again. https://github.com/cdt15/lingam/pull/133/commits/e64892b165823249db902fa3ca20edbde3ecd346

cdt15 / lingam