davidrohr / hpl-gpu

High Performance Linpack for GPUs (Using OpenCL, CUDA, CAL)
Other
88 stars 14 forks source link

undefined reference to `fatbinData' #2

Closed kaiattrib closed 6 years ago

kaiattrib commented 6 years ago

After compilation of caldgemm successfully ,When I'm compiling the HPL-GPU, I got the lib link error.

Log as follows:

-rpath=~/hpl-gpu/lib -ldl -L/root/cuda-8.0/lib64 -lcudart -lcudadevrt -lcublas -L ~/softwares/software_install/OpenMPI/lib64 -lmpi -lmpi_cxx
/tmp/ccSp3tGD.ltrans28.ltrans.o:(.nvFatBinSegment+0x8): undefined reference to `fatbinData'
collect2: error: ld returned 1 exit status
make[2]: *** [dexe.grd] Error 1

env: MKL, CUDA8.0, OpenMPI,CentOS7

I got the same error in CUDA8 and CUDA9.

Where am I wrong, can you give me some advice?

davidrohr commented 6 years ago

Hi,

this seems to be a problem with the link time optimization. Please try to disable CONFIG_LTO in caldgemm/config_options.mak and/or HPL_USE_LTO in hpl/Make.Generic.Options.

Cheers David

kaiattrib commented 6 years ago

Thanks a lot ! sorry to have troubled you @davidrohr

I encountered a strange problem that is when runing the HPL-GPU test,the CPU computing only,but the GPU process allocation succeeded.

Use the command "watch -n 1 nvidia-smi ", always get result like this: the less GPU-power and GPU-Util is 0%. Final get benchmark is CPU level, not the GPU level, indicates that GPU computing is not used at all. Chang the GPU Ratio in the HPL-GPU.conf get the same result.

Why?

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 387.26 Driver Version: 387.26 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:06:00.0 Off | 0 | | N/A 35C P0 32W / 250W | 1205MiB / 16276MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla P100-PCIE... On | 00000000:84:00.0 Off | 0 | | N/A 33C P0 31W / 250W | 1205MiB / 16276MiB | 0% Default | +-------------------------------+----------------------+----------------------+

Otherwise, when the NB size more than 512 in the HPL.dat . will get the error like this

CUDA Error 33: invalid resource handle caldgemm_cuda.cu:574 Recording simpleQueueEvent event 0 0 Error in CALDGEMM Run, aborting HPL Run https://github.com/davidrohr/caldgemm/blob/master/caldgemm_cuda.cu#L574

davidrohr commented 6 years ago

Hi,

for such small NB size, the HPL will not use the GPU, as it would be dominated by the transfer. The larger NB should run on the GPU, but seems to fail in your case, but from that message I cannot judge why. You can try to enable debug output in the generic settings file.

kaiattrib commented 6 years ago

Thanks very much! After try, get the log as follows when the NB size more than 512 in the HPL.dat

Have any tips?

` An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 40000 NB : 1152 PMAP : Column-major process mapping P : 1 Q : 1 PFACT : Crout NBMIN : 64 NDIV : 2 RFACT : Left BCAST : MPI DEPTH : 2 SWAP : Spread-roll (long) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words SEED : 100 Config : -I/home/user002/chenkai/hpl-gpu/include -I/home/user002/chenkai/hpl-gpu/include/Generic -isystem /home/user002/chenkai/hpl-gpu/caldgemm -isystem /home/user002/intel/mkl/include -isystem /usr/local/cuda/include -isystem /home/user002/intel_2017/impi/2017.4.256/include64 -DHPL_PAGELOCKED_MEM -DHPL_INTERLEAVE_C -DHPL_REGISTER_MEMORY -DHPL_HUGE_TABLES -DHPL_FASTINIT -DHPL_FASTVERIFY -DHPL_FAST_GPU -DHPL_COPY_L -DHPL_COPYL_DURING_FACT -DHPL_ASYNC_DLATCPY -DHPL_NO_MPI_THREAD_CHECK -DHPL_MPI_WRAPPERS -DHPL_NO_MPI_DATATYPE -DHPL_GPU_PERFORMANCE_WARNINGS -DHPL_PRINT_CONFIG -DHPL_WARMUP -DHPL_GPU_RUNTIME_CONFIG -DHPL_LASWP_AVX -DHPL_CALDGEMM_BACKEND=cuda -DUSE_MKL -DCALDGEMM_CUDA_CUBLAS -DHPL_LOOKAHEAD_2B -DHPL_PRINT_INTERMEDIATE -DHPL_GPU_NOT_QUIET -DHPL_DETAILED_TIMING -DCALDGEMM_TEST -DHPL_DETAILED2_TIMING -DHPL_GPU_EXTRA_CALDGEMM_OPTIONS=cal_info.DeviceNum = -1;cal_info.NumaPinning = 0;cal_info.CPUInContext = 0;cal_info.Use3rdPartyTranspose = true;cal_info.GPU_C = 1;cal_info.SimpleGPUQueuing = true;cal_info.ImprovedScheduler = true; cal_info.ImprovedSchedulerBalance = 2;cal_info.DstMemory = 'g';cal_info.DstMemory = 'g';cal_info.KeepBuffersMapped = false;cal_info.AsyncSideQueueBalance = 1;


Runtime Option "HPL_MPI_AFFINITY", Parameter "1" Runtime Option "HPL_NUM_LASWP_CORES", Parameter "14" Runtime Option "HPL_WARMUP", Parameter "enabled" Parsing CALDGEMM Runtime Command Options: -j 1.0 -Ca 100000 AlternateLookahead: 0 changed to 100000 Initializing CALDGEMM (CUDA Runtime) Running on 1 devices with 30 bbuffers (gpu03) Using 14 threads for LASWP ( 1 3 4 5 6 7 8 9 10 11 12 13 14 15 ) (Problem: N 40000 NB 1152)(Network: BCAST 407 LOOKAHEAD 2) (Factorization: NBMIN 64 NBDIV 2 PFACT 302 RFACT 301) Process col 0 processes 35 matrix cols

Running warmup iteration Iteration j=0 N=40000 n=40000 jb=1152 Totaltime=0.000 Timer RPFACT (11) CPU Time 4.51411 Wall Time 0.72264 Timer BCAST (20) CPU Time 0.00000 Wall Time 0.00001 Starting DGEMM Run m=38849 k=1152 n=38848 Alpha=-1.000000 Beta=1.000000 LDA=0x9c48 LDB=0x9c48 LDC=0x9c48 At=0 Bt=0 ColMajor=1 (A=0x1029bda2000, B=0x10285e02400, C=0x1029bda4400, (C-A=1152, (C-B)/w=40008), Linpack=2) Ratio 1.000000 - gpu_m 38848 gpu_n 38848 - Split m Favor m - Height 3072 (/ 4096), Min Tiling 32 (1984, 1984) Slave thread starting cblas (m: 38849, n: 38848, cblas_size: 1 (3072), dynamic: 0/0, cpu_k: 169) Timer LASWP (15) CPU Time 0.16288 Wall Time 0.01196 Timer DTRSM (18) CPU Time 0.13555 Wall Time 0.01093 CUDA Error 0: no error caldgemm_cuda.cu:556 CUDA Conversion Kernel Execution Error in CALDGEMM Run, aborting HPL Run Timer LASWP (15) CPU Time 0.51475 Wall Time 0.03730

` the error in the code https://github.com/davidrohr/caldgemm/blob/master/caldgemm_cuda.cu#L556

davidrohr commented 6 years ago

Hi,

I just figures there is a bug in the debug output code. Could you update caldgemm to the master branch, and rerun.

Anyway, most like, the issue that you are facing is that your GPU model architecture level is not defined in caldgemm config.mak (CUDAVERSION setting). I have added pascal GPU (level 61) to the defaults. If you need another level, please add it there.

Cheers David

kaiattrib commented 6 years ago

Thanks for you!

I set the CUDAVERSION=61 in the config.mak. and recompiled successfully. But got the same error CUDA Error 8: invalid device function

I google the P100 level is 60. After set the CUDAVERSION=60. Got the new error

CUDA Error 33: invalid resource handle caldgemm_cuda.cu:575 Recording simpleQueueEvent event 0 0 Error in CALDGEMM Run, aborting HPL Run

https://github.com/davidrohr/caldgemm/blob/master/caldgemm_cuda.cu#L575

I have google the error: cuda Error Invalid Resource Handle : This indicates that a resource handle passed to the API call was not valid. Resource handles are opaque types like cudaStream_t and cudaEvent_t. Link

Have any idea?

davidrohr commented 6 years ago

Can you, in the caldgemm directory, run the following: ./dgemmbench -O 2 - -e

It should do a quick test of the CUDA dgemm and verify the matrix multiplication result.

kaiattrib commented 6 years ago

log as follows:

Use -? for help Initializing CALDGEMM (CUDA Runtime) Cannot use multiple devices without multithreading Running on 1 devices with 30 bbuffers (gpu06) Initializing Data... ...alloc A (33024 KB) B (32832 KB) C (131328 KB)......init A...init B...Done Doing initial run... Done Initializing Matrix C Running Benchmark Starting DGEMM Run m=4096 k=1024 n=4096 Alpha=-1.000000 Beta=1.000000 LDA=0x408 LDB=0x1008 LDC=0x1008 At=0 Bt=0 ColMajor=0 (A=0x2acb21000000, B=0x2acb23040000, C=0x2acb25050000, (C-A=8429568, (C-B)/w=4104), Linpack=0) Ratio 0.000000 - gpu_m 4096 gpu_n 4096 - Split n Favor m - Height 4096 (/ 4096), Min Tiling 4096 (0, 0) Program: caldgemm Sizes - A: 4096x1024 B: 1024x4096 C:4096x4096 (Host: gpu06) System Time 0.024 System Gflops 1460.519 Verifying results can take a long time on large matrices. CPU Time: 1.501743 Gflops: 22.879908 Passed!

kaiattrib commented 6 years ago

When in the single node with single GPU, it works perfect.

But in the single node whit double GPU,it has error, I find every process xhpl both using the double GPU.

like this: mpirun -np 1 ./xhpl

| GPU PID Type Process name Usage | |============================== =============| | 0 5202 C ./xhpl 1171MiB | | 1 5202 C ./xhpl 1171MiB | +-----------------------------------------------------------------------------+

Is support single node with multi-GPU or multi-node whit multi-GPU?

davidrohr commented 6 years ago

multi-node with multi-gpu should work. (I am sure it does for opencl) But actually, the cuda version was never tested much, since we never had a cuda cluster to run on. Actually, I am not sure if the cuda version was ever tested with multi-node.

kaiattrib commented 6 years ago

Thanks!

With OpenCL also get the error:

` Initializing CALDGEMM (OpenCL Runtime) Running on 2 devices with 30 bbuffers (gpu06) Using 14 threads for LASWP ( 1 3 4 5 6 7 8 9 10 11 12 13 14 15 ) (Problem: N 8000 NB 512)(Network: BCAST 407 LOOKAHEAD 2) (Factorization: NBMIN 64 NBDIV 2 PFACT 302 RFACT 301) Process col 0 processes 16 matrix cols Allocating memory: 516859016 bytes...01Error allocating memory (clEnqueueMapBuffer) (0: Success!)

HPL ERROR from process # 0, on line 242 of function HPL_pdtest: [0,0] Memory allocation failed for A, x and b. Skip.

Finished 1 tests with the following results: 0 tests completed and passed residual checks, 0 tests completed and failed residual checks, 1 tests skipped because of illegal input values.

End of tests. `

Can solve the single node with double GPU ?

davidrohr commented 6 years ago

hi, it might be that opencl works only with amd gpus, never tried with nividia.

single node multi-gpu with nvidia should work. Kind Regards David Rohr

(Sent from my mobile, excuse the typos.)

On 5 April 2018 16:17:55 CEST, Ken Chan notifications@github.com wrote:

Thanks!

With OpenCL also get the error:

` Initializing CALDGEMM (OpenCL Runtime) Running on 2 devices with 30 bbuffers (gpu06) Using 14 threads for LASWP ( 1 3 4 5 6 7 8 9 10 11 12 13 14 15 ) (Problem: N 8000 NB 512)(Network: BCAST 407 LOOKAHEAD 2) (Factorization: NBMIN 64 NBDIV 2 PFACT 302 RFACT 301) Process col 0 processes 16 matrix cols Allocating memory: 516859016 bytes...01Error allocating memory (clEnqueueMapBuffer) (0: Success!)

HPL ERROR from process # 0, on line 242 of function HPL_pdtest: [0,0] Memory allocation failed for A, x and b. Skip.

Finished 1 tests with the following results: 0 tests completed and passed residual checks, 0 tests completed and failed residual checks, 1 tests skipped because of illegal input values.

End of tests. `

Can solve the single node with double GPU ?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/davidrohr/hpl-gpu/issues/2#issuecomment-378952041

kaiattrib commented 6 years ago

Thanks!

I have been try many times. All in single node with double gpu.

when export CUDA_VISIBLE_DEVICES=0 or CUDA_VISIBLE_DEVICES=1 to using a single gpu. It works perfect on the gpu01 or gpu02.

But export CUDA_VISIBLE_DEVICES=0,1 to using double gpu, It error like this

` Initializing CALDGEMM (CUDA Runtime) Running on 2 devices with 30 bbuffers (gpu06) Using 14 threads for LASWP ( 1 3 4 5 6 7 8 9 10 11 12 13 14 15 ) (Problem: N 40000 NB 1920)(Network: BCAST 407 LOOKAHEAD 2) (Factorization: NBMIN 64 NBDIV 2 PFACT 302 RFACT 301) Process col 0 processes 21 matrix cols Allocating memory: 12862231688 bytes...

Running warmup iteration Iteration j=0 N=40000 n=40000 jb=1920 Totaltime=0.000 Timer RPFACT (11) CPU Time 28.72164 Wall Time 1.93543 Timer BCAST (20) CPU Time 0.00065 Wall Time 0.00006 Starting DGEMM Run m=38081 k=1920 n=38080 Alpha=-1.000000 Beta=1.000000 LDA=0x9c48 LDB=0x9c48 LDC=0x9c48 At=0 Bt=0 ColMajor=1 (A=0x2b0890a0e000, B=0x2b086c003c00, C=0x2b0890a11c00, (C-A=1920, (C-B)/w=40008), Linpack=2) Ratio 1.000000 - gpu_m 38080 gpu_n 38080 - Split m Favor m - Height 3072 (/ 4096), Min Tiling 32 (1216, 1216) Slave thread starting cblas (m: 38081, n: 38080, cblas_size: 1 (3072), dynamic: 0/0, cpu_k: 169) Timer LASWP (15) CPU Time 0.55227 Wall Time 0.03319 Timer DTRSM (18) CPU Time 0.53059 Wall Time 0.02596 CUDA Error 33: invalid resource handle caldgemm_cuda.cu:575 Recording simpleQueueEvent event 0 0 Error in CALDGEMM Run, aborting HPL Run Segmentation fault (core dumped) `