dmalhotra / pvfmm

A parallel kernel-independent FMM library for particle and volume potentials
http://pvfmm.org
GNU Lesser General Public License v3.0
51 stars 28 forks source link

Segmentation fault when compiled with cuda #3

Closed wenyan4work closed 7 years ago

wenyan4work commented 7 years ago

Hi I am trying to compile your code and had a segmentation fault problem when compiled with cuda.

Software: Centos 7, mpicxx(openmpi 1.10.0 + gcc 4.8.5), cuda-7.5, nvidia-driver 367.35 Hardware: Xeon E5 2643 V3 x2, 128GB mem pvfmm cloned from the github repo.

When compiled with cpu only, the examples run smoothly. When I configured it with cuda, for example:

 /configure MPICXX=/usr/lib64/openmpi/bin/mpicxx --prefix=/home_local/wyan_local/software/PVFMM/install --with-cuda=/usr/local/cuda 

The examples throw segmentation fault. For example with 1 openmp thread:

example1 -N 512

gives

        W-List {
        }
        U-List {
        }
        V-List {
        }
        D2H_Wait:LocExp {
Segmentation fault (core dumped)

I looked the code a bit and it seems the loop copy dev_ptr to host_ptr at line 681 in fmm_tree.txx gives that segmentation fault

  Profile::Tic("D2H_Wait:LocExp",this->Comm(),false,5);
  if(device) if(setup_data[0+MAX_DEPTH*2].output_data!=NULL){
    Real_t* dev_ptr=(Real_t*)&fmm_mat->staging_buffer[0];
    Matrix<Real_t>& output_data=*setup_data[0+MAX_DEPTH*2].output_data;
    size_t n=output_data.Dim(0)*output_data.Dim(1);
    Real_t* host_ptr=output_data[0];
    output_data.Device2HostWait();

    #pragma omp parallel for
    for(size_t i=0;i<n;i++){
      host_ptr[i]+=dev_ptr[i];
    }
  }

I have tried moving from openmpi 1.10 to 2.0(latest), and to latest mpich. Also configured pvfmm with different gcc/nvcc compiling flags from '-g -O0' to '-O2' to 'mtune=native', '-gencode arch=compute_52,code=sm_52 '. All the combinations give the same segamentation fault.

Could you please help me locate the problem?

Thank you,

wenyan4work commented 7 years ago

This is the compiler output to build the examples. A debug build with '-g -O2 -std=gnu++11 -fopenmp'


pvfmm/ $ make all-examples
cd ./examples && make;
make[1]: Entering directory `/home_local/wyan_local/software/PVFMM/pvfmm/examples'
/usr/lib64/openmpi/bin/mpicxx -g -O2 -std=gnu++11 -fopenmp -DALLTOALLV_FIX -I/home_local/wyan_local/software/PVFMM/pvfmm/include   -I/usr/local/cuda/include -I./include -c src/example1.cpp -o obj/example1.o
/usr/lib64/openmpi/bin/mpicxx -g -O2 -std=gnu++11 -fopenmp -DALLTOALLV_FIX -I/home_local/wyan_local/software/PVFMM/pvfmm/include   -I/usr/local/cuda/include obj/example1.o -L/home_local/wyan_local/software/PVFMM/pvfmm/lib -lpvfmm -lfftw3  -lfftw3f -lopenblas  -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../.. -ldl -lstdc++ -lgfortran -lm -lquadmath -lX11 -L/usr/local/cuda/lib64 -lcuda -lcudart -lcublas -ldl -lstdc++ -lm  -o bin/example1
/usr/lib64/openmpi/bin/mpicxx -g -O2 -std=gnu++11 -fopenmp -DALLTOALLV_FIX -I/home_local/wyan_local/software/PVFMM/pvfmm/include   -I/usr/local/cuda/include -I./include -c src/example2.cpp -o obj/example2.o
/usr/lib64/openmpi/bin/mpicxx -g -O2 -std=gnu++11 -fopenmp -DALLTOALLV_FIX -I/home_local/wyan_local/software/PVFMM/pvfmm/include   -I/usr/local/cuda/include obj/example2.o -L/home_local/wyan_local/software/PVFMM/pvfmm/lib -lpvfmm -lfftw3  -lfftw3f -lopenblas  -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../.. -ldl -lstdc++ -lgfortran -lm -lquadmath -lX11 -L/usr/local/cuda/lib64 -lcuda -lcudart -lcublas -ldl -lstdc++ -lm  -o bin/example2
/usr/lib64/openmpi/bin/mpicxx -g -O2 -std=gnu++11 -fopenmp -DALLTOALLV_FIX -I/home_local/wyan_local/software/PVFMM/pvfmm/include   -I/usr/local/cuda/include -I./include -c src/fmm_pts.cpp -o obj/fmm_pts.o
/usr/lib64/openmpi/bin/mpicxx -g -O2 -std=gnu++11 -fopenmp -DALLTOALLV_FIX -I/home_local/wyan_local/software/PVFMM/pvfmm/include   -I/usr/local/cuda/include obj/fmm_pts.o -L/home_local/wyan_local/software/PVFMM/pvfmm/lib -lpvfmm -lfftw3  -lfftw3f -lopenblas  -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../.. -ldl -lstdc++ -lgfortran -lm -lquadmath -lX11 -L/usr/local/cuda/lib64 -lcuda -lcudart -lcublas -ldl -lstdc++ -lm  -o bin/fmm_pts
/usr/lib64/openmpi/bin/mpicxx -g -O2 -std=gnu++11 -fopenmp -DALLTOALLV_FIX -I/home_local/wyan_local/software/PVFMM/pvfmm/include   -I/usr/local/cuda/include -I./include -c src/fmm_cheb.cpp -o obj/fmm_cheb.o
/usr/lib64/openmpi/bin/mpicxx -g -O2 -std=gnu++11 -fopenmp -DALLTOALLV_FIX -I/home_local/wyan_local/software/PVFMM/pvfmm/include   -I/usr/local/cuda/include obj/fmm_cheb.o -L/home_local/wyan_local/software/PVFMM/pvfmm/lib -lpvfmm -lfftw3  -lfftw3f -lopenblas  -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../.. -ldl -lstdc++ -lgfortran -lm -lquadmath -lX11 -L/usr/local/cuda/lib64 -lcuda -lcudart -lcublas -ldl -lstdc++ -lm  -o bin/fmm_cheb
rm obj/example1.o obj/fmm_pts.o obj/example2.o obj/fmm_cheb.o
make[1]: Leaving directory `/home_local/wyan_local/software/PVFMM/pvfmm/examples'
pvfmm/ $ ./examples/bin/example1 -N 512
InitTree {
    InitRoot {
    }
    Points2Octee {
    }
    ScatterPoints {
    }
    PointerTree {
    }
    InitFMMData {
    }
}
InitFMM_Tree {
    RefineTree {
    }
    2:1Balance {
Balance Octree. inpSize: 8 tmpSize: 1 outSize: 8 activeNpes: 1
    }
}
InitFMM_Pts {
    LoadMatrices {
        ReadFile {
        }
        Broadcast {
        }
    }
    PrecompUC2UE {
    }
    PrecompDC2DE {
    }
    PrecompBC {
    }
    PrecompU2U {
    }
    PrecompD2D {
    }
    Save2File {
    }
    PrecompV {
    }
    PrecompV1 {
    }
}
SetupFMM {
    ConstructLET {
    }
    SetColleagues {
    }
    CollectNodeData {
    }
    BuildLists {
    }
    UListSetup {
    }
    WListSetup {
    }
    XListSetup {
    }
    VListSetup {
    }
    D2DSetup {
    }
    D2TSetup {
    }
    S2USetup {
    }
    U2USetup {
    }
    ClearFMMData {
    }
}
RunFMM {
    UpwardPass {
        S2U {
        }
        U2U {
        }
    }
    ReduceBcast {
    }
    DownwardPass {
        Setup {
        }
        Host2Device:Src {
        }
        X-List {
        }
        Host2Device:Mult {
        }
        Device2Host:LocExp {
        }
        W-List {
        }
        U-List {
        }
        V-List {
        }
        D2H_Wait:LocExp {
Segmentation fault (core dumped)
pvfmm/ $ 
dmalhotra commented 7 years ago

The particle FMM does not support GPU acceleration right now. We only have accelerator support for computing volume integrals. If you are not using the volume FMM, then for now you should compile without CUDA.

I am working on fixing the segmentation fault, but we don't have plans to add GPU acceleration to the particle code anytime soon.

dmalhotra commented 7 years ago

The latest commit has fixed this bug.

wenyan4work commented 7 years ago

Thank you for the fast fix! I tested it on my machine and it works now.

May I ask you for some information about the PKIFMM code in your group? That code also performs particle KIFMM but depends on some old packages. Which one do you think is better for particle FMM on MPI clusters?

dmalhotra commented 7 years ago

The original PKIFMM code has some GPU support but it is not being actively maintained. As you have mentioned, it requires some old packages and can be difficult to install.

The new PVFMM code has an optimized V-list algorithm and vectorized AVX implementations of several kernel functions. Running only on the CPU, PVFMM is significantly faster (about 3x). The GPU implementation of PKIFMM may have similar performance to PVFMM (on CPU), but I haven't actually compared. The MPI algorithms are similar for both codes and should have similar scalability. I would recommend using PVFMM.

wenyan4work commented 7 years ago

Thanks a lot!