LuxGraph / Lux

A Distributed Multi-GPU System for Fast Graph Processing
Apache License 2.0
63 stars 8 forks source link

A question about GASNet #5

Open JsonZhangAA opened 5 years ago

JsonZhangAA commented 5 years ago

Hi, I have a question about GASNet.If I have installed GASNet,I should run Lux like that: ./pagerank 1 -ll:gpu 2 -ll:fsize 6000 -ll:zsize 5000 -file ~/zy/hollywood.lux -ni 1. And I also set the environment variable SSH_SERVERS. But it has the following error:

*** GASNET WARNING: int AMUDP_SPMDStartup_AMUDP_NDEBUG(int*, char***, int, int, amudp_spawnfn_t, uint64_t*, amudp_eb**, amudp_ep**) returning an error code: AM_ERR_RESOURCE (Problem with requested resource)
  from function AMUDP_SPMDStartup
  at ./amudp_spmd.cpp:961
  reason: slave failed DNSLookup on master host name
GASNet initialization encountered an error: "slave AMUDP_SPMDStartup() failed"
  in gasnetc_init at /home/aim/zhangyang_workplace/GASNet-1.32.0/udp-conduit/gasnet_core.c:242
GASNet gasnetc_init returning an error code: GASNET_ERR_RESOURCE (Problem with requested resource)
  at /home/aim/zhangyang_workplace/GASNet-1.32.0/udp-conduit/gasnet_core.c:306
GASNet gasnet_init_GASNET_1320PARnopshmFASTnodebugnotracenostatsnodebugmallocnosrclines returning an error code: GASNET_ERR_RESOURCE (Problem with requested resource)
  at /home/aim/zhangyang_workplace/GASNet-1.32.0/udp-conduit/gasnet_core.c:312
GASNET: gasnet_init(argc, argv) = 3 (GASNET_ERR_RESOURCE, Problem with requested resource) 

Sorry, my English is not very good.

jiazhihao commented 5 years ago

Hi,

It seems you are running Lux on a single machine, in which case you don't need to build with GASNet enabled. In the case you would like to do distributed runs, you can use the following command line to start gasnet (wrap pagerank with mpirun)

mpirun -n NM ./pagerank -ni 10 -file /cstor/stanford/aaiken/users/zhihao/LuxGraphs/hollywood.lux  -ll:gpu 2 -ll:cpu 2 -ll:fsize 5000 -ll:zsize 5000

where NM is the number of machines. You can also replace mpirun with srun or gasnetrun.

JsonZhangAA commented 5 years ago

Hi,

I have built a distributed version of Lux,Like this,

make clean USE_GASNET=1 make -j 4

And run Lux,like this,

mpirun -n 2 ./pagerank  -ni 1 -file ~/zy/hollywood.lux -ll:gpu 2 -ll:cpu 2 -ll:fsize 5000 -ll:zsize 5000
GASNet: Invalid number of nodes: -ni
GASNet: Usage './pagerank <num_nodes> {program arguments}'
GASNet: Invalid number of nodes: -ni
GASNet: Usage './pagerank <num_nodes> {program arguments}'
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
jiazhihao commented 5 years ago

Hi,

I just realized that you are using GASNet UDP conduit, whose launch is slightly different than IBV convudit. Can you install another version of GASNet available at: https://github.com/StanfordLegion/gasnet. It includes all configuration parameters for building a correct version of GASNet. If you still encounter the issue in your first email, I will redirect this to the Legion/GASNet team to figure it out.

JsonZhangAA commented 5 years ago

Hi,

I install GASNet at https://github.com/StanfordLegion/gasnet.And I run it. aim@aim-PowerEdge-R730:~/zy/Lux/pagerank$ mpirun -n 2 ./pagerank -ni 1 -file ~/zy/hollywood.lux -ll:gpu 1 -ll:cpu 1 -ll:fsize 5000 -ll:zsize 5000

GASNet gasnetc_init returning an error code: GASNET_ERR_RESOURCE (Problem with requested resource)
  at /home/aim/zy/gasnet-master/GASNet-1.30.0/ibv-conduit/gasnet_core.c:1625
  reason: unable to open any HCA ports
GASNet gasnet_init_GASNET_1300PARnopshmFASTnodebugnotracenostatsnodebugmallocnosrclines returning an error code: GASNET_ERR_RESOURCE (Problem with requested resource)
  at /home/aim/zy/gasnet-master/GASNet-1.30.0/ibv-conduit/gasnet_core.c:1911
GASNET: gasnet_init(argc, argv) = 10002 (GASNET_ERR_RESOURCE, Problem with requested resource)
GASNet gasnetc_init returning an error code: GASNET_ERR_RESOURCE (Problem with requested resource)
  at /home/aim/zy/gasnet-master/GASNet-1.30.0/ibv-conduit/gasnet_core.c:1625
  reason: unable to open any HCA ports
GASNet gasnet_init_GASNET_1300PARnopshmFASTnodebugnotracenostatsnodebugmallocnosrclines returning an error code: GASNET_ERR_RESOURCE (Problem with requested resource)
  at /home/aim/zy/gasnet-master/GASNet-1.30.0/ibv-conduit/gasnet_core.c:1911
GASNET: gasnet_init(argc, argv) = 10002 (GASNET_ERR_RESOURCE, Problem with requested resource)
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 59579 on
node aim-PowerEdge-R730 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

I think it may be my computer doesn't have the InfiniBand.Can I run Lux with mpi?

elliottslaughter commented 5 years ago

Can you try:

git clone https://github.com/StanfordLegion/gasnet.git
cd gasnet
make CONDUIT=mpi

And then when you build Lux:

make clean
USE_GASNET=1 CONDUIT=mpi make -j4

Please note: the mpi conduit of GASNet is not high-performance, it should be used for testing only. If you want high performance we'd generally recommend you use a high-performance network (Infiniband, Aries, etc.) or if you are forced to use Ethernet then the udp conduit of GASNet.

elliottslaughter commented 5 years ago

If you do need high performance on an Ethernet network, here would be my recommendation:

git clone https://github.com/StanfordLegion/gasnet.git
cd gasnet
make CONDUIT=udp

And then when you build Lux:

make clean
USE_GASNET=1 CONDUIT=udp make -j4

And then to run do something like:

SSH_SERVERS=localhost,localhost ../../gasnet/release/bin/amudprun -n 2 ./pagerank ...

Note the use of amudprun instead of mpirun.

JsonZhangAA commented 5 years ago

Hi, Thanks for your reply. I use udp according to your method, but it shows the following error.

aim@aim-PowerEdge-R730:~/zy/Lux/pagerank$ SSH_SERVERS=localhost,P4000 ~/zy/gasnet-master/release/bin/amudprun -n 2 ./pagerank  -ni 1 -file ~/zy/hollywood.lux -ll:gpu 1  -ll:fsize 5000 -ll:zsize 5000
aim@p4000's password: 
AMUDP int AMUDP_SPMDStartup_AMUDP_NDEBUG(int*, char***, int, int, amudp_spawnfn_t, uint64_t*, amudp_eb**, amudp_ep**) returning an error code: AM_ERR_RESOURCE (Problem with requested resource)
  from function AMUDP_SPMDStartup
  at /home/aim/zy/gasnet/GASNet-1.30.0/other/amudp/amudp_spmd.cpp:988
  reason: slave failed DNSLookup on master host name
GASNet initialization encountered an error: "slave AMUDP_SPMDStartup() failed"
  in gasnetc_init at /home/aim/zy/gasnet/GASNet-1.30.0/udp-conduit/gasnet_core.c:242
GASNet gasnetc_init returning an error code: GASNET_ERR_RESOURCE (Problem with requested resource)
  at /home/aim/zy/gasnet/GASNet-1.30.0/udp-conduit/gasnet_core.c:306
GASNet gasnet_init_GASNET_1300PARnopshmFASTnodebugnotracenostatsnodebugmallocnosrclines returning an error code: GASNET_ERR_RESOURCE (Problem with requested resource)
  at /home/aim/zy/gasnet/GASNet-1.30.0/udp-conduit/gasnet_core.c:312
GASNET: gasnet_init(argc, argv) = 3 (GASNET_ERR_RESOURCE, Problem with requested resource)

P4000 is another machine.What is the reason for this? I have no idea.

elliottslaughter commented 5 years ago

The udp conduit in GASNet needs to use SSH to connect to the other machines (assuming you don't have some sort of job launcher installed on the machine). I believe in order for this to work you need to be able to SSH without a password. The easiest way to do this would be to create a passwordless SSH key, and copy the public key into ~/.ssh/authorized_keys.

So you should be able to manually run ssh p4000 and log in without a password prompt. If this doesn't work you'll have to fix this before trying GASNet again.

elliottslaughter commented 5 years ago

The other thing I see in your error trace is that it seems like DNS lookup is failing:

reason: slave failed DNSLookup on master host name

I don't know how your machine is configured, but if DNS isn't set up properly, you may need to use the direct IP address of the machine instead of its hostname.

Edit: and this probably goes for both machines (remember that the other machine will have to establish a connection back to your current machine as well, so localhost doesn't provide enough information to do this).

JsonZhangAA commented 5 years ago

Hi, I solved the problem about DNS. And I can SSH without a password.But I have a question. In Titan(node 0), I can run Lux on P4000(node 1) and I can also do that on Titan.But I can't run Lux on both nodes.

aim@aim-PowerEdge-R730:~/zy/Lux/pagerank$ SSH_SERVERS=Titan ../../gasnet-master/release/bin/amudprun -n 1 ./pagerank  -ni 1 -file ~/zy/hollywood.lux -ll:gpu 1  -ll:fsize 5000 -ll:zsize 5000
[0 - 7f3af6f8d700] {3}{pagerank}: PageRank settings: numPartitions(1) numIter(1) filename = /home/aim/zy/hollywood.lux
[0 - 7f3af6f8d700] {3}{graph}: Load graph: numNodes(1139905) numEdges(57515616)
[0 - 7f3af6f8d700] {3}{graph}: left_bound = 0 right_bound = 1139904
[Memory Setting] Set ll:fsize >= 470MB and ll:zsize >= 242MB
[0 - 7f3af6f8d700] {3}{graph}: Load task: file(/home/aim/zy/hollywood.lux) rowLeft(0) rowRight(1139904) colLeft(0) colRight(57515615)
[0 - 7f3af6f8d700] {3}{pagerank}: Start PageRank computation...
[0 - 7f3af6f8d700] {3}{pagerank}: Finish PageRank computation...
ELAPSED TIME = 0.0146050 s
aim@aim-PowerEdge-R730:~/zy/Lux/pagerank$ SSH_SERVERS=P4000 ../../gasnet-master/release/bin/amudprun -n 1 ./pagerank  -ni 1 -file ~/zy/hollywood.lux -ll:gpu 1  -ll:fsize 5000 -ll:zsize 5000
slave args: aim-PowerEdge-R730:57339
[0 - 7f63553ce700] {3}{pagerank}: PageRank settings: numPartitions(1) numIter(1) filename = /home/aim/zy/hollywood.lux
[0 - 7f63553ce700] {3}{graph}: Load graph: numNodes(1139905) numEdges(57515616)
[0 - 7f63553ce700] {3}{graph}: left_bound = 0 right_bound = 1139904
[Memory Setting] Set ll:fsize >= 470MB and ll:zsize >= 242MB
[0 - 7f63553ce700] {3}{graph}: Load task: file(/home/aim/zy/hollywood.lux) rowLeft(0) rowRight(1139904) colLeft(0) colRight(57515615)
[0 - 7f63553ce700] {3}{pagerank}: Start PageRank computation...
ELAPSED TIME = 0.0138280 s
[0 - 7f63553ce700] {3}{pagerank}: Finish PageRank computation..

The following situation has a problem

aim@aim-PowerEdge-R730:~/zy/Lux/pagerank$ SSH_SERVERS=Titan,P4000 ../../gasnet-master/release/bin/amudprun -n 2 ./pagerank  -ni 1 -file ~/zy/hollywood.lux -ll:gpu 1  -ll:fsize 5000 -ll:zsize 5000
slave args: aim-PowerEdge-R730:55492
[0 - 7fa17cdc4700] {3}{pagerank}: PageRank settings: numPartitions(1) numIter(1) filename = /home/aim/zy/hollywood.lux
[0 - 7fa17cdc4700] {3}{graph}: Load graph: numNodes(1139905) numEdges(57515616)
[0 - 7fa17cdc4700] {3}{graph}: left_bound = 0 right_bound = 600856
[0 - 7fa17cdc4700] {3}{graph}: left_bound = 600857 right_bound = 1139904
*** Caught a fatal signal: SIGSEGV(11) on node 1/2
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace. 
aim@aim-PowerEdge-R730:~/zy/Lux/pagerank$ bash: line 1: 21751 Segmentation fault      (core dumped) env 'AMUDP_SLAVE_ARGS=1,aim-PowerEdge-R730:55492,' './pagerank' '-ni' '1' '-file' '/home/aim/zy/hollywood.lux' '-ll:gpu' '1' '-ll:fsize' '5000' '-ll:zsize' '5000'

Do I still need a node to run P4000 and Titan? Thank you for your prompt reply.

I found that the problem happend with the execution of the following statement:

Rect<1> r = runtime->get_index_space_domain(ctx,col_idx.get_index_space());

elliottslaughter commented 5 years ago

Can you get a backtrace? It's best to build with DEBUG=1 (note: make clean first) and then run with REALM_FREEZE_ON_ERROR=1 in the environment. You should get an error message like Process X has frozen on node N. and then you can attach with gdb -p X to dump a backtrace with thread apply all bt.

JsonZhangAA commented 5 years ago

Hi, It prompts the following information:

aim@aim:~/zy/LuxOld/pagerank$ SSH_SERVERS=115.157.201.179,115.157.201.181 ~/zy/gasnet/release/bin/amudprun -n 2 ./pagerank  -ni 1 -file ~/zy/LuxData/indochina.lux -ll:gpu 1 -ll:fsize 5000 -ll:zsize 5000
*** GASNET WARNING: Both $GASNET_SSH_SERVERS and $SSH_SERVERS are set, to different values. Using the former.
slave args: 115,157,201,181,137,177
[0 - 7f0ad1104700] {3}{pagerank}: PageRank settings: numPartitions(1) numIter(1) filename = /home/aim/zy/LuxData/indochina.lux
[0 - 7f0ad1104700] {3}{graph}: Load graph: numNodes(7414866) numEdges(194109311)
[0 - 7f0ad1104700] {3}{graph}: left_bound = 0 right_bound = 5322925
[0 - 7f0ad1104700] {3}{graph}: left_bound = 5322926 right_bound = 7414865
*** Caught a fatal signal: SIGSEGV(11) on node 1/2
[1] /usr/bin/gdb -nx -batch -x /tmp/gasnet_at4eoA '/home/aim/zy/LuxOld/pagerank/./pagerank' 2918
[1] No threads.
aim@aim:~/zy/LuxOld/pagerank$ bash: line 1:  2918 Segmentation fault      (core dumped) env 'AMUDP_SLAVE_ARGS=1,115,157,201,181,137,177,' './pagerank' '-ni' '1' '-file' '/home/aim/zy/LuxData/indochina.lux' '-ll:gpu' '1' '-ll:fsize' '5000' '-ll:zsize' '5000'
^C
elliottslaughter commented 5 years ago

It's a bit hard to tell what's going on here since the backtrace appears to have failed, but here's what I'd recommend to debug it:

  1. Rebuild with with DEBUG=1 (note: make clean first).
  2. Run with REALM_FREEZE_ON_ERROR=1 and not GASNET_BACKTRACE=1. (Edit: I suspect the latter was set in your run above because it was trying to get a backtrace automatically. It should not have been doing this if only REALM_FREEZE_ON_ERROR=1 was set.)
  3. Wait for message like Process 123 is frozen on node n0000.
  4. SSH to n0000 and gdb -p 123, then thread apply all bt and copy-and-paste that into a file to be attached here.
JsonZhangAA commented 5 years ago

Hi, I follow the method you recommended to execute the program. After step 2, I don't receive the message like Process 123 is frozen on node n0000. It shows the follow information:

aim@aim:~/zy/LuxOld/pagerank$ REALM_FREEZE_ON_ERROR=1 SSH_SERVERS=115.157.201.181,115.157.201.179 ~/zy/gasnet/release/bin/amudprun -n 2 ./pagerank  -ni 1 -file ~/zy/LuxData/hollywood.lux -ll:gpu 1 -ll:cpu 1 -ll:fsize 5000 -ll:zsize 5000
slave args: 115,157,201,181,187,105
[0 - 7f4d16a96700] {3}{pagerank}: PageRank settings: numPartitions(1) numIter(1) filename = /home/aim/zy/LuxData/hollywood.lux
[0 - 7f4d16a96700] {3}{graph}: Load graph: numNodes(1139905) numEdges(57515616)
[0 - 7f4d16a96700] {3}{graph}: left_bound = 0 right_bound = 600856
[0 - 7f4d16a96700] {3}{graph}: left_bound = 600857 right_bound = 1139904
*** Caught a fatal signal: SIGSEGV(11) on node 1/2
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace.
aim@aim:~/zy/LuxOld/pagerank$ bash: line 1: 18910 Segmentation fault      (core dumped) env 'AMUDP_SLAVE_ARGS=1,115,157,201,181,187,105,' './pagerank' '-ni' '1' '-file' '/home/aim/zy/LuxData/hollywood.lux' '-ll:gpu' '1' '-ll:cpu' '1' '-ll:fsize' '5000' '-ll:zsize' '5000'

This is the information of DEBUG=1 USE_GASNET=1 CONDUIT=udp make -j4

aim@aim:~/zy/LuxOld/pagerank$ DEBUG=1 USE_GASNET=1 CONDUIT=udp make -j4
g++ -o pagerank.cc.o -c pagerank.cc  -std=c++11 -DUSE_DISK  -march=native -DUSE_LIBDL -DREALM_USE_OPENMP -DREALM_OPENMP_GOMP_SUPPORT -DREALM_OPENMP_KMP_SUPPORT -DUSE_CUDA -DUSE_GASNET -DGASNETI_BUG1389_WORKAROUND=1 -DGASNET_CONDUIT_UDP -DUSE_ZLIB -DDEBUG_REALM -DDEBUG_LEGION -O0 -ggdb  -DCOMPILE_TIME_MIN_LEVEL=LEVEL_DEBUG-Wall -Wno-strict-overflow -I../cub -I. -I../legion/runtime -I../legion/runtime/mappers -I/usr/local/cuda/include -I../legion/runtime/realm/transfer -I/home/aim/zy/gasnet/release/include -I/home/aim/zy/gasnet/release/include/udp-conduit
g++ -o ../core/lux_mapper.cc.o -c ../core/lux_mapper.cc  -std=c++11 -DUSE_DISK  -march=native -DUSE_LIBDL -DREALM_USE_OPENMP -DREALM_OPENMP_GOMP_SUPPORT -DREALM_OPENMP_KMP_SUPPORT -DUSE_CUDA -DUSE_GASNET -DGASNETI_BUG1389_WORKAROUND=1 -DGASNET_CONDUIT_UDP -DUSE_ZLIB -DDEBUG_REALM -DDEBUG_LEGION -O0 -ggdb  -DCOMPILE_TIME_MIN_LEVEL=LEVEL_DEBUG         -Wall -Wno-strict-overflow -I../cub -I. -I../legion/runtime -I../legion/runtime/mappers -I/usr/local/cuda/include -I../legion/runtime/realm/transfer -I/home/aim/zy/gasnet/release/include -I/home/aim/zy/gasnet/release/include/udp-conduit
/usr/local/cuda/bin/nvcc -o pagerank_gpu.cu.o -c pagerank_gpu.cu  -std=c++11 -DUSE_CUDA -DDEBUG_REALM -DDEBUG_LEGION -g -O0 -arch=compute_52 -code=sm_52 -DMAXWELL_ARCH -Xptxas "-v"  -I../cub -I. -I../legion/runtime -I../legion/runtime/mappers -I/usr/local/cuda/include -I../legion/runtime/realm/transfer -I/home/aim/zy/gasnet/release/include -I/home/aim/zy/gasnet/release/include/udp-conduit
rm -f liblegion.a
ar rc liblegion.a ../legion/runtime/legion/legion.cc.o ../legion/runtime/legion/legion_c.cc.o ../legion/runtime/legion/legion_ops.cc.o ../legion/runtime/legion/legion_tasks.cc.o ../legion/runtime/legion/legion_context.cc.o ../legion/runtime/legion/legion_trace.cc.o ../legion/runtime/legion/legion_spy.cc.o ../legion/runtime/legion/legion_profiling.cc.o ../legion/runtime/legion/legion_profiling_serializer.cc.o ../legion/runtime/legion/legion_instances.cc.o ../legion/runtime/legion/legion_views.cc.o ../legion/runtime/legion/legion_analysis.cc.o ../legion/runtime/legion/legion_constraint.cc.o ../legion/runtime/legion/legion_mapping.cc.o ../legion/runtime/legion/region_tree.cc.o ../legion/runtime/legion/runtime.cc.o ../legion/runtime/legion/garbage_collection.cc.o ../legion/runtime/legion/mapper_manager.cc.o ../legion/runtime/mappers/default_mapper.cc.o ../legion/runtime/mappers/mapping_utilities.cc.o ../legion/runtime/mappers/shim_mapper.cc.o ../legion/runtime/mappers/test_mapper.cc.o ../legion/runtime/mappers/replay_mapper.cc.o ../legion/runtime/mappers/debug_mapper.cc.o ../legion/runtime/mappers/wrapper_mapper.cc.o
rm -f librealm.a
ar rc librealm.a ../legion/runtime/realm/runtime_impl.cc.o ../legion/runtime/realm/transfer/transfer.cc.o ../legion/runtime/realm/transfer/channel.cc.o ../legion/runtime/realm/transfer/channel_disk.cc.o ../legion/runtime/realm/transfer/lowlevel_dma.cc.o ../legion/runtime/realm/module.cc.o ../legion/runtime/realm/threads.cc.o../legion/runtime/realm/faults.cc.o ../legion/runtime/realm/operation.cc.o ../legion/runtime/realm/tasks.cc.o ../legion/runtime/realm/metadata.cc.o ../legion/runtime/realm/deppart/partitions.cc.o ../legion/runtime/realm/deppart/sparsity_impl.cc.o ../legion/runtime/realm/deppart/image.cc.o ../legion/runtime/realm/deppart/preimage.cc.o ../legion/runtime/realm/deppart/byfield.cc.o ../legion/runtime/realm/deppart/setops.cc.o ../legion/runtime/realm/event_impl.cc.o ../legion/runtime/realm/rsrv_impl.cc.o ../legion/runtime/realm/proc_impl.cc.o ../legion/runtime/realm/mem_impl.cc.o ../legion/runtime/realm/inst_impl.cc.o ../legion/runtime/realm/inst_layout.cc.o ../legion/runtime/realm/machine_impl.cc.o ../legion/runtime/realm/sampling_impl.cc.o ../legion/runtime/realm/transfer/lowlevel_disk.cc.o ../legion/runtime/realm/numa/numa_module.cc.o ../legion/runtime/realm/numa/numasysif.cc.o ../legion/runtime/realm/openmp/openmp_module.cc.o ../legion/runtime/realm/openmp/openmp_threadpool.cc.o ../legion/runtime/realm/openmp/openmp_api.cc.o ../legion/runtime/realm/procset/procset_module.cc.o ../legion/runtime/realm/cuda/cuda_module.cc.o ../legion/runtime/realm/cuda/cudart_hijack.cc.o ../legion/runtime/realm/activemsg.cc.o ../legion/runtime/realm/logging.cc.o ../legion/runtime/realm/cmdline.cc.o ../legion/runtime/realm/profiling.cc.o ../legion/runtime/realm/codedesc.cc.o ../legion/runtime/realm/timers.cc.o
ptxas info    : 1 bytes gmem
ptxas info    : Compiling entry function '_ZN3cub11EmptyKernelIvEEvv' for 'sm_52'
ptxas info    : Function properties for _ZN3cub11EmptyKernelIvEEvv
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 2 registers, 320 bytes cmem[0]
ptxas info    : Compiling entry function '_Z11init_kerneljjmP10NodeStructP10EdgeStructPKmPKjS6_' for 'sm_52'
ptxas info    : Function properties for _Z11init_kerneljjmP10NodeStructP10EdgeStructPKmPKjS6_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 22 registers, 376 bytes cmem[0]
ptxas info    : Compiling entry function '_Z9pr_kerneljjmfPK10NodeStructPK10EdgeStructPfS5_' for 'sm_52'
ptxas info    : Function properties for _Z9pr_kerneljjmfPK10NodeStructPK10EdgeStructPfS5_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 48 registers, 4392 bytes smem, 376 bytes cmem[0], 12 bytes cmem[2]
ptxas info    : Compiling entry function '_Z11load_kerneljPKjPfPKf' for 'sm_52'
ptxas info    : Function properties for _Z11load_kerneljPKjPfPKf
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 10 registers, 352 bytes cmem[0]
---> Linking objects into one binary: pagerank
g++ -o pagerank pagerank.cc.o ../core/lux_mapper.cc.o pagerank_gpu.cu.o  -L. -llegion -lrealm -lrt -lpthread-ldl -rdynamic -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib64/stubs -lcuda  -Xlinker -rpath=/usr/local/cuda/lib64 -L/home/aim/zy/gasnet/release/lib -lrt -lm -lgasnet-udp-par -lamudp -lz
jiazhihao commented 5 years ago

Hi, I tried to reproduce your execution on our cluster with IBV conduit and it worked fine. This confirms that the segment fault you saw is not related to the application code. We still need a backtrack to figure out if the segfault is in the GASNet layer or the Legion runtime.

JsonZhangAA commented 5 years ago

Hi, I am checking the cause of the problem again, and I will inform you of the news. Thank you for your prompt reply.

elliottslaughter commented 5 years ago

@JsonZhangAA Can you try this command instead?

SSH_SERVERS=115.157.201.181,115.157.201.179 ~/zy/gasnet/release/bin/amudprun -n 2 bash -c "REALM_FREEZE_ON_ERROR=1 ./pagerank -ni 1 -file ~/zy/LuxData/hollywood.lux -ll:gpu 1 -ll:cpu 1-ll:fsize 5000 -ll:zsize 5000"

The problem is amudprun is using ssh to launch the command, and ssh doesn't pass environment variables to the new process.

JsonZhangAA commented 5 years ago

Hi,@elliottslaughter Your method does work.

aim@aim:~/zy/LuxNew/pagerank$ SSH_SERVERS=115.157.201.181,115.157.201.179 ~/zy/gasnet/release/bin/amudprun -n 2 bash -c "REALM_FREEZE_ON_ERROR=1  ./pagerank -ni 1 -file ~/zy/LuxData/hollywood.lux -ll:gpu 1 -ll:fsize 5000 -ll:zsize 5000"
slave args: 115,157,201,181,202,41
*** FATAL ERROR(Node 0): An active message was returned to sender,
    and trapped by the default returned message handler (handler 0):
Error Code: ECONGESTION: Congestion at destination endpoint
Message type: AM_REQUEST_XFER_M
Destination: (115.157.201.179:54900) (1)
Handler: 141
Tag: 0x739dc9b50001043d
Arguments(8): 0x00000000  0x00000001  0x6f527400  0x00007f20  0x00000000  0x00000003  0x00000004  0x00000002
Aborting...
Legion process received signal 6: Aborted
Process 1167 on node aim is frozen!

This is the information of gdb -p 1167

aim@aim:~$ sudo gdb -p 1167
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 1167
[New LWP 1168]
[New LWP 1169]
[New LWP 1170]
[New LWP 1171]
[New LWP 1172]
[New LWP 1173]
[New LWP 1174]
[New LWP 1175]
[New LWP 1176]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f207016130d in nanosleep () at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: 没有那个文件或目录.
(gdb) thread apply all bt

Thread 10 (Thread 0x7f2043fff700 (LWP 1176)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000000bdbd49 in Realm::DMAThread::dma_thread_loop() ()
#2  0x0000000000bfe8d8 in Realm::KernelThread::pthread_entry(void*) ()
#3  0x00007f2071c9a6ba in start_thread (arg=0x7f2043fff700) at pthread_create.c:333
#4  0x00007f207019c41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 9 (Thread 0x7f2048921700 (LWP 1175)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000000000109d43f in Realm::Cuda::GPUWorker::process_streams(bool) ()
#2  0x000000000109d6ad in Realm::Cuda::GPUWorker::thread_main() ()
#3  0x0000000000bfe8d8 in Realm::KernelThread::pthread_entry(void*) ()
#4  0x00007f2071c9a6ba in start_thread (arg=0x7f2048921700) at pthread_create.c:333
#5  0x00007f207019c41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 8 (Thread 0x7f2049122700 (LWP 1174)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000000c1daa6 in Realm::PartitioningOpQueue::worker_thread_loop() ()
#2  0x0000000000bfe8d8 in Realm::KernelThread::pthread_entry(void*) ()
#3  0x00007f2071c9a6ba in start_thread (arg=0x7f2049122700) at pthread_create.c:333
#4  0x00007f207019c41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 7 (Thread 0x7f2049923700 (LWP 1173)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000000be83e8 in Realm::DmaRequestQueue::dequeue_request(bool) ()
#2  0x0000000000bed610 in Realm::DmaRequestQueue::worker_thread_loop() ()
#3  0x0000000000bfe8d8 in Realm::KernelThread::pthread_entry(void*) ()
#4  0x00007f2071c9a6ba in start_thread (arg=0x7f2049923700) at pthread_create.c:333
#5  0x00007f207019c41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 6 (Thread 0x7f2049b24700 (LWP 1172)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00000000010a8912 in IncomingMessageManager::get_messages(int&, bool) ()
#2  0x00000000010a8a12 in IncomingMessageManager::handler_thread_loop() ()
#3  0x0000000000bfe8d8 in Realm::KernelThread::pthread_entry(void*) ()
#4  0x00007f2071c9a6ba in start_thread (arg=0x7f2049b24700) at pthread_create.c:333
#5  0x00007f207019c41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 5 (Thread 0x7f204a325700 (LWP 1171)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f2071c9cefe in __GI___pthread_mutex_lock (mutex=0x1663ba0 <gasnetc_AMlock>)
    at ../nptl/pthread_mutex_lock.c:135
#2  0x00000000010c3e10 in gasnetc_AMPoll ()
#3  0x00000000010aa15a in EndpointManager::polling_worker_loop() ()
#4  0x0000000000bfe8d8 in Realm::KernelThread::pthread_entry(void*) ()
---Type <return> to continue, or q <return> to quit---
#5  0x00007f2071c9a6ba in start_thread (arg=0x7f204a325700) at pthread_create.c:333
#6  0x00007f207019c41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 4 (Thread 0x7f204ab26700 (LWP 1170)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f2070e070bd in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007f2070dbde74 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007f2070e06468 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007f2071c9a6ba in start_thread (arg=0x7f204ab26700) at pthread_create.c:333
#5  0x00007f207019c41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 3 (Thread 0x7f204b327700 (LWP 1169)):
#0  0x00007f207019074d in poll () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007f2070e049a3 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007f2070e6ce8d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007f2070e06468 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007f2071c9a6ba in start_thread (arg=0x7f204b327700) at pthread_create.c:333
#5  0x00007f207019c41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 2 (Thread 0x7f1c93364700 (LWP 1168)):
#0  0x00007f207019d8c8 in accept4 (fd=12, addr=..., addr_len=0x7f1c93363d98, flags=524288)
    at ../sysdeps/unix/sysv/linux/accept4.c:40
#1  0x00007f2070e058e6 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007f2070df8c6d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007f2070e06468 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007f2071c9a6ba in start_thread (arg=0x7f1c93364700) at pthread_create.c:333
#5  0x00007f207019c41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 1 (Thread 0x7f20722b97c0 (LWP 1167)):
#0  0x00007f207016130d in nanosleep () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007f207016125a in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2  0x0000000000ba66ba in Realm::realm_freeze(int) ()
#3  <signal handler called>
#4  0x00007f20700ca428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#5  0x00007f20700cc02a in __GI_abort () at abort.c:89
#6  0x0000000001128558 in AMUDP_FatalErr ()
#7  0x000000000113225c in AMUDP_DefaultReturnedMsg_Handler ()
#8  0x000000000112e53f in AMUDP_HandleRequestTimeouts(amudp_ep*, int) ()
#9  0x0000000001130ab3 in AM_Poll ()
#10 0x00000000010c3e1f in gasnetc_AMPoll ()
#11 0x00000000010a9a0f in do_some_polling() ()
#12 0x000000000107760d in Realm::NodeAnnounceMessage::await_all_announcements() ()
#13 0x0000000000bb4292 in Realm::RuntimeImpl::configure_from_command_line(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) ()
#14 0x0000000000bb5215 in Realm::Runtime::configure_from_command_line(int, char**) ()
---Type <return> to continue, or q <return> to quit---
#15 0x0000000000b16573 in Legion::Internal::Runtime::initialize(int*, char***) ()
#16 0x0000000000b52298 in Legion::Internal::Runtime::start(int, char**, bool) ()
#17 0x00000000007a9316 in main ()
(gdb)
JsonZhangAA commented 5 years ago

Hi, Is there any requirement for the specific model of the Infiniband NIC? What type of network card are you using? I want to reproduce your execution on our cluster with IBV conduit.

elliottslaughter commented 5 years ago

I'm not aware of any requirements on the infiniband NIC. We're run Legion on a number of supercomputers that have infiniband:

jiazhihao commented 5 years ago

More specifically, all experiments in the VLDB paper were performed on xstream, whose specs is available at http://xstream.stanford.edu/specs/.

JsonZhangAA commented 5 years ago

Hi, I solved the problem about udp conduit. Thanks. @ElliottSlaughter @jiazhihao