StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
678 stars 145 forks source link

A problem about single node with multi-gpu #565

Closed Clinuxnewbie closed 5 years ago

Clinuxnewbie commented 5 years ago

When I run code (using the command:./sssp 1 -level gpu=2 -ll:gpu 2 -ll:fsize 1200 -ll:zsize 2000 -file ~/zy/LuxData/hollywood.lux -start 0) in single node with 2 different version of GPUs(GeForce RTX 2080 and Tesla K20m).

And In legion/runtime/runtime.mk file, setting GPU_ARCH as follows:

#defaults for CUDA
#GPU_ARCH ?= fermi
GPU_ARCH ?= kepler (Tesla K20m)
#GPU_ARCH ?= k20
#GPU_ARCH ?= pascal
#GPU_ARCH ?= volta
GPU_ARCH ?= turing (GeForce RTX 2080)

Report such information:

(base) dcase@dcase-PowerEdge-R730:~/saffron/demo/Lux/sssp$ ./sssp -level gpu=2 -ll:gpu 2 -ll:fsize 1200 -ll:zsize 2000 -file ~/zy/LuxData/hollywood.lux -start 0
[0 - 7f68bdf63780] {2}{gpu}: GPU #0: GeForce RTX 2080 (7.5) 7952 MB
[0 - 7f68bdf63780] {2}{gpu}: GPU #1: Tesla K20m (3.5) 5062 MB
[0 - 7f68bdf63780] {2}{gpu}: registering fat binary 0x2b34460 with GPU 0x8bf7cd0
[0 - 7f68bdf63780] {2}{gpu}: Loaded CUDA Module. JIT Output: 
[0 - 7f68bdf63780] {2}{gpu}: registering fat binary 0x2b34460 with GPU 0x956f930
[0 - 7f68bdf63780] {5}{gpu}: ERROR: The binary was compiled for the wrong GPU architecture. Update the 'GPU_ARCH' flag at the top of runtime/runtime.mk to match your current GPU architecture.
[0 - 7f68bdf63780] {5}{gpu}: Failed to load CUDA module! Error log: 
CU: cuModuleLoadDataEx = 209 (CUDA_ERROR_NO_BINARY_FOR_GPU): no kernel image is available for execution on the device
sssp: /home/dcase/zy/legion/runtime/realm/cuda/cuda_module.cc:2478: CUmod_st* Realm::Cuda::GPU::load_cuda_module(const void*): Assertion `0' failed.
*** Caught a fatal signal: SIGABRT(6) on node 0/1
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace. 

And running ./deviceQuery in cuda-10.0/samples/1_Utilities/deviceQuery file, printing this massage:

> Peer access from GeForce RTX 2080 (GPU0) -> Tesla K20m (GPU1) : No
> Peer access from Tesla K20m (GPU1) -> GeForce RTX 2080 (GPU0) : No

If solving the promble, how should I do?

streichler commented 5 years ago

Which branch/commit of the Legion tree are you using? It looks like you're using a version of the build system that only compiled CUDA binaries for a single architecture. If you switch to a recent pull of the master branch (commit efc9467 was pushed two days ago), you can set GPU_ARCH to either auto (the default - builds for all architectures supported by nvcc) or to 35,75 (the two architectures you need for the gpus listed above, and that'll hopefully fix the problem you're seeing.

Clinuxnewbie commented 5 years ago

Thank for your help. According to your suggestion, my problem is solved.