The factorazation time I tested is too much slower than the data in paper

The GPU i used is NVIDIA A100 40GB, and CPU is Intel Xeon E5-2695 v4. I test the program on single GPU. These matrices are significantly slower than the paper : ASIC_680k(3.62x slower), cage12(10.18x slower), dielFilterV3real(3.38x slower), nlpkkt80(2.01x slower). Here is a result about cage12 matrix.

CUDA Runtime: 12030 CUDA Driver: 12000 MPI Processes 1 Matrix is ../matrix/cage12.tsv ADAPTIVE_KERNEL_SELECTION -------------ON SYNCHRONIZE_FREE ----------------------ON N is 130228 ,NNZ is 2032536 the gpu num is 2 the rank is 0 use the 0 gpu Device NVIDIA A100-PCIE-40GB Symbolic L + U NNZ = 570274806 the symbolic time is 1.003155 s ../matrix/cage12_00200/ read ../matrix/cage12_00200/ rank 0 the preprocess time is 127.743160 s rank 0 the numerical time is 578.705797 s 7.308713 GFLOPs filename = ../matrix/cage12.tsv matrix_name : cage12 rank getrf tstrf gessm ssssm cal_sum 0 0.40429 10.30226 7.36425 541.90883 559.97963 pangulu_test---------------------------finish

Why are my test results so much slower than in the paper?

Hello Edward!

I am glad to see your message. Upon reviewing the program, I have identified some issues.

Firstly, you did not use the original code from GitHub for testing, but rather used the code from Zenodo.

PanguLU is inherently a distributed heterogeneous direct solver. The code on Zenedo is primarily intended to replicate the performance of our experiments on a large-scale distributed computing machine.

If you have more GPUs, such as eight A100 40GB GPUs, the total computation time on each GPU will match the experimental performance in our paper when summed.

You pointed out our code on Zenodo, and earlier tests revealed that for some matrices, running on a small number of A100 40GB GPUs would cause out-of-memory issues. To avoid this, we added new checks in the pangulu_preprocess.h file compared to the original GitHub code.

This approach causes PanguLU to allocate a very small segment of GPU memory initially, placing the required computation space in CPU memory and transferring it to the GPU when needed. This leads to additional GPU-to-CPU bandwidth overhead, resulting in the situation you described. The code is as follows:

    GPU_MEMORY_FLAG=0;
    if(sum_rank_size==1){
        int_t Big_GPU_Memory=((int_t)1000*1000*1000)*8;
        if(CPU_MEMORY>(Big_GPU_Memory)){
            GPU_MEMORY_FLAG=1;
        }
    }
    else if(sum_rank_size<=2){
        int_t Big_GPU_Memory=((int_t)1000*1000*1000)*4;
        if(CPU_MEMORY>(Big_GPU_Memory)){
            GPU_MEMORY_FLAG=1;
        }
    }
    else if(sum_rank_size<=4){
        int_t Big_GPU_Memory=((int_t)1000*1000*1000)*20;
        if(CPU_MEMORY>(Big_GPU_Memory)&&(N>=500000)){
            GPU_MEMORY_FLAG=1;
        }
    }

If you use eight GPUs, this situation will not occur, so I recommend using eight GPUs if you have more A100 GPU cards available. This is also the easiest modification to make.

Additionally, a single A100 80GB GPU can also be used for testing. You only need to delete this segment of code and force GPU_MEMORY_FLAG to 0, as shown below, to ensure that the performance matches what is described in our paper:

    GPU_MEMORY_FLAG=0;
    /*
    if(sum_rank_size==1){
        int_t Big_GPU_Memory=((int_t)1000*1000*1000)*8;
        if(CPU_MEMORY>(Big_GPU_Memory)){
            GPU_MEMORY_FLAG=1;
        }
    }
    else if(sum_rank_size<=2){
        int_t Big_GPU_Memory=((int_t)1000*1000*1000)*4;
        if(CPU_MEMORY>(Big_GPU_Memory)){
            GPU_MEMORY_FLAG=1;
        }
    }
    else if(sum_rank_size<=4){
        int_t Big_GPU_Memory=((int_t)1000*1000*1000)*20;
        if(CPU_MEMORY>(Big_GPU_Memory)&&(N>=500000)){
            GPU_MEMORY_FLAG=1;
        }
    }
    */

I also noticed some issues with your testing platform. You are using an Intel Xeon E5-2695, which has the following specifications:

Processor Number: E5-2695 v2
Number of Cores: 12
Number of Threads: 24
Base Frequency: 2.4 GHz

In our experiments, we used the Kunpeng 920 7265, with specifications as follows:

Processor Number: Kunpeng 920 7265
Number of Cores: 32
Number of Threads: 32
Base Frequency: 2.6 GHz

Additionally, the different machines result in different PCIe bandwidths.

The performance discrepancy in your experiments might be due to differences in CPU performance and PCIe data transfer bandwidth. You can replicate our experiments testing each sparse BLAS performance.

Theoretically, our sparse BLAS should not have issues, and the problem likely lies with the CPU-to-A100 GPU bandwidth.

PanguLU is designed for distributed computing systems. To simplify the design, even a single GPU requires data to be copied to the CPU after computation, which can cause severe bandwidth issues with only one GPU.

Thanks for your response! I want to test the code on single GPU. And I compile the Github code successfully, and run the test.mtx successfully too. But when I test on other .mtx format matrices, it stuck for a long time. The preprocess spends a lot of time. I want to know where to set up to run on a single GPU. the programe details are as follow:

mpirun -np 1 PanguLU -NB 512 -F Si2.mtx
MPI Processes 1 NB is 512
Matrix is MM/Si2.mtx
ADAPTIVE_KERNEL_SELECTION ------------OFF
SYNCHRONIZE_FREE ----------------------ON
N is 769 ,NNZ is 17801
Device NVIDIA A100 80GB PCIe
PanguLU the reorder time is 0.008938 s
Symbolic nonzero = 278407
PanguLU the symbolic time is 0.001503 s
PanguLU the preprocess time is 0.058928 s

I wait for a long time, and it seems to be stucked. How should I set the parameters to run on single GPU? or somewhere I should edit? Thank you very much!

I carefully reviewed your output information and noticed some issues.

Firstly, your matrix dimension is 769. Assuming it is a dense matrix, the number of non-zero elements would be 769 * 769 = 591,361. However, your output shows 278,407 non-zero elements after symbolic factorization, indicating that the matrix is nearly dense post-factorization.

Did you enable the Metis option during compilation? It is possible that you did not correctly use the Metis library. PanguLU requires the 64-bit Metis library to function properly.

Assuming PanguLU needs to run on a nearly dense sparse matrix, I also noticed that you did not enable the ADAPTIVE_KERNEL_SELECTION option for adaptive sparse BLAS. This indicates that you are trying to use pure sparse BLAS to run nearly dense matrix blocks.

It is likely that the process is stalling during the numerical factorization phase because you are using sparse BLAS to solve a nearly dense block, which is particularly inefficient.

You can monitor the NVIDIA GPU utilization to check if your GPU is constantly computing. Use the following code to check if the A100 GPU utilization remains around 100% while running the code:

watch -n 0.1 nvidia-smi

Therefore, I recommend the following modifications:

First, verify that Metis is properly linked. Detailed instructions for enabling Metis can be found in the README.

Note that the metis version needs to be 5.0.2 or higher and 64-bit.

Next, I suggest enabling the ADAPTIVE_KERNEL_SELECTION option for adaptive sparse BLAS to help accelerate the computation. This can be done by modifying the code in pangulu_common.h as follows:

#define ADAPTIVE_KERNEL_SELECTION // Enable ADAPTIVE_KERNEL_SELECTION
#define SYNCHRONIZE_FREE
#define PANGULU_MC64
#define METIS
#define SYMBOLIC
#define PANGULU_SPTRSV

Thanks for your help! I'm sure that METIS is properly linked, and the version of METIS is 5.1.0 and 64-bit. When I compiled the PanguLU, I met some troubles about linking METIS. The METIS cannot link to GKlib, I searched for answers in github and finally solved the problem. Then I enable ADAPTIVE_KERNEL_SELECTION, the preprocess is very slow. The GPU is 100% used. I think the NB_size shoule be set properly, but I don't know how much to set. I test the matrix is cited from paper, ASIC_680K.mtx. The detailed information are as follow:

mpirun -np 1 PanguLU -NB 256 -F matrix/ASIC_680k.mtx
MPI Processes 1 NB is 256
Matrix is matrix/ASIC_680k.mtx
ADAPTIVE_KERNEL_SELECTION -------------ON
SYNCHRONIZE_FREE ----------------------ON
N is 682862 ,NNZ is 3871773
Device NVIDIA A100 80GB PCIe
PanguLU the reorder time is 3.725488 s
Symbolic nonzero = 116553596
PanguLU the symbolic time is 3.514559 s
PanguLU the preprocess time is 104.623883 s

I wait for about an hour, and the GPU memory usage is 4753MB, GPU usage is 100%, both of them are not changed. I think there must be some places stuck the program.

and this is my make.inc

CUDA_PATH = /usr/local/cuda
LU_INC =-I../include -I$(HOME)/local/include
CUDA_INC = -I$(CUDA_PATH)/include
CUDA_LIB = -L$(CUDA_PATH)/lib64 -lcudart -lcusparse
# METIS_INC=-I
MPI_ROOT=/usr/mpi/gcc/openmpi-4.1.7a1

NVCC = nvcc
CC = gcc 
CPP = g++
MPICPP = mpic++

CPPFLAGS = -fopenmp -O3 
MPICPPFLAGS = -O3 -Wall -std=c++11 -lm -lpthread -fopenmp  $(LU_INC) $(CUDA_INC)
NVCCFLAGS = -O3 -w -Xptxas -dlcm=cg -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_61,code=compute_61 $(LU_INC) $(CUDA_INC) $(CUDA_LIB)
MPICPPLINK = -L/$(MPI_ROOT)/lib 

METISFLAGS = -L$(HOME)/local/lib/ -lmetis

When I change the NVCCFLAGS compute_61 to compute_80(change all 61 to 80), there is a error:

nvcc -O3 -w -Xptxas -dlcm=cg -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -I../include -I/staff/wangchao/local/include -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcudart -lcusparse  -Xcompiler -fPIC  -c pangulu_cuda.cu -o pangulu_cuda.o
pangulu_cuda.cu(533): error: identifier "__shfl" is undefined
      global_x_id = __shfl(global_x_id, 0);
                    ^

1 error detected in the compilation of "pangulu_cuda.cu".

In your response, I noticed some interesting points.

You mentioned attempting NVCCFLAGS compute_61 to compute_80, which might be causing errors due to the CUDA version. PanguLU requires CUDA-11.3.0, and it is possible that having a higher CUDA version might be causing some shfl instructions to fail, leading to deadlocks within the program. This could be the reason why your program isn't running correctly.

Here are some other versions you might want to consider: GCC 9.3.0; OpenMPI-4.1.2; CUDA 11.3.0; metis-5.0.2.

For the ASIC_680K.mtx matrix, when NB is 256, the block size is too small, causing the GPU computation to be underutilized. You can try adjusting to a larger NB block size in the range of 1000 to 5000, such as 1000, 1500, 2000, etc.

SuperScientificSoftwareLaboratory / PanguLU

The factorazation time I tested is too much slower than the data in paper #2