Open bohara opened 12 years ago
I think this is the bug Marwan should fix soonish anyways, where the intersection of 'affinity GPUs' and 'allocated GPUs' is 0.
Marwan - please verify this bug and take appropriate action.
Ok,
On Fri, May 17, 2013 at 12:46 PM, Stefan Eilemann notifications@github.comwrote:
Marwan - please verify this bug and take appropriate action.
— Reply to this email directly or view it on GitHubhttps://github.com/Eyescale/Equalizer/issues/167#issuecomment-18054843 .
@eile : Do you know where can I find Bidurs' equalizer configs ?
Autoconf should take care of this. You simply need an allocation where CPU intersection GPU affinity is empty.
Well, can't reproduce the bug.
I try to use the same allocation command "salloc -N1 -n1 --gres=gpu:3 -p interactive" Then I am running the X server for the 3 GPUs "srun -n1 --gres=gpu:3 -w bbplxviz07 --startx --pty /bin/bash" Then, vglconnect"ing" to the node being allocated "vglconnect bbplxvizXX" Then running eqPly. It doesn't give any segmentation faults.
26163 PipeDraw c/Equalizer/eq/client/pipe.cpp:199 45 Entered pipe thread 26163 PipeDraw c/Equalizer/eq/client/pipe.cpp:310 45 Set up pipe message pump for GLX 26163 PipeDraw c/Equalizer/eq/client/pipe.cpp:338 45 Get Automatic Affinity for 0 26163 PipeDraw c/Equalizer/eq/client/pipe.cpp:344 45 port, device = 4294967295,4294967295 26163 PipeDraw c/Equalizer/eq/client/pipe.cpp:349 45 port, device = 4294967295,4294967295 No Affinity 26163 PipeDraw eq/client/glx/windowSystem.cpp:54 45 Using glx::Pipe 26163 PipeDraw1 c/Equalizer/eq/client/pipe.cpp:199 46 Entered pipe thread 26163 PipeDraw1 c/Equalizer/eq/client/pipe.cpp:310 46 Set up pipe message pump for GLX 26163 PipeDraw1 c/Equalizer/eq/client/pipe.cpp:338 46 Get Automatic Affinity for 1 26163 PipeDraw1 c/Equalizer/eq/client/pipe.cpp:344 46 port, device = 0,1 26163 PipeDraw2 c/Equalizer/eq/client/pipe.cpp:199 57 Entered pipe thread 26163 PipeDraw2 c/Equalizer/eq/client/pipe.cpp:310 57 Set up pipe message pump for GLX 26163 PipeDraw2 c/Equalizer/eq/client/pipe.cpp:338 57 Get Automatic Affinity for 2 26163 PipeDraw2 c/Equalizer/eq/client/pipe.cpp:344 57 port, device = 0,2 26163 PipeDraw1 c/Equalizer/eq/client/pipe.cpp:402 92 For [port, device] = 0,1 : GPU is found. 26163 PipeDraw2 c/Equalizer/eq/client/pipe.cpp:402 112 For [port, device] = 0,2 : GPU is found.
Connecting to the node and vglrun eqPly for the first time works fine. However, trying to rerun it after that, "Aborted". I get the following output, 26855 PipeDraw c/Equalizer/eq/client/pipe.cpp:338 23 Get Automatic Affinity for 0 26855 PipeDraw c/Equalizer/eq/client/pipe.cpp:344 23 port, device = 4294967295,4294967295 26855 PipeDraw c/Equalizer/eq/client/pipe.cpp:349 23 port, device = 4294967295,4294967295 No Affinity 26855 PipeDraw eq/client/glx/windowSystem.cpp:54 23 Using glx::Pipe 26855 PipeDraw1 c/Equalizer/eq/client/pipe.cpp:199 24 Entered pipe thread 26855 PipeDraw1 c/Equalizer/eq/client/pipe.cpp:310 24 Set up pipe message pump for GLX 26855 PipeDraw1 c/Equalizer/eq/client/pipe.cpp:338 24 Get Automatic Affinity for 1 26855 PipeDraw1 c/Equalizer/eq/client/pipe.cpp:344 24 port, device = 0,1 26855 PipeDraw2 c/Equalizer/eq/client/pipe.cpp:199 31 Entered pipe thread 26855 PipeDraw2 c/Equalizer/eq/client/pipe.cpp:310 31 Set up pipe message pump for GLX 26855 PipeDraw2 c/Equalizer/eq/client/pipe.cpp:338 31 Get Automatic Affinity for 2 26855 PipeDraw2 c/Equalizer/eq/client/pipe.cpp:344 31 port, device = 0,2 26855 PipeDraw1 c/Equalizer/eq/client/pipe.cpp:402 65 For [port, device] = 0,1 : GPU is found. 26855 PipeDraw2 c/Equalizer/eq/client/pipe.cpp:402 79 For [port, device] = 0,2 : GPU is found. 26855 PipeDraw1 eq/client/glx/windowSystem.cpp:54 95 Using glx::Pipe 26855 PipeDraw2 eq/client/glx/windowSystem.cpp:54 99 Using glx::Pipe XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":0" after 41 requests (41 known processed) with 0 events remaining. XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":0" after 41 requests (41 known processed) with 0 events remaining. 26855 PipeDraw2 ox/lunchbox/pluginRegistry.cpp:99 111 Assert: plugins.empty() [Plugin registry not de-initialized] , in: lunchbox::abort() lunchbox::detail::PluginRegistry::~PluginRegistry() lunchbox::PluginRegistry::~PluginRegistry() /lib64/libc.so.6(exit+0xe2) [0x2acce4401da2] _XDefaultIOError _XIOError _XReply /usr/lib64/nvidia/libGL.so.1(+0xb7f39) [0x2accaf299f39] 26855 PipeDraw2 rc/Lunchbox/lunchbox/debug.cpp:44 111 Aborted (core dumped)
This bug is inconsistent.
This is the actual error:
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
after 41 requests (41 known processed) with 0 events remaining.
To debug, put xlib into synchronous mode and run in gdb to see which x call is causing this. It might be caused by the affinity stuff.
Disabling the automatic affinity by forcing the Pipe::_getAutoAffinity()
to return lunchbox::Thread::NONE
gives the same in consistent bug.
11257 Equalizer/eq/server/server.cpp:196 3073
11257 Main alizer/eq/client/cvTracker.cpp:43 16 Did not find OpenCV camera 0
11257 PipeDraw c/Equalizer/eq/client/pipe.cpp:200 20 Entered pipe thread
11257 PipeDraw c/Equalizer/eq/client/pipe.cpp:311 20 Set up pipe message pump for GLX
11257 PipeDraw eq/client/glx/windowSystem.cpp:54 20 Using glx::Pipe
11257 PipeDraw1 c/Equalizer/eq/client/pipe.cpp:200 21 Entered pipe thread
11257 PipeDraw1 c/Equalizer/eq/client/pipe.cpp:311 21 Set up pipe message pump for GLX
11257 PipeDraw1 eq/client/glx/windowSystem.cpp:54 21 Using glx::Pipe
11257 PipeDraw2 c/Equalizer/eq/client/pipe.cpp:200 22 Entered pipe thread
11257 PipeDraw2 c/Equalizer/eq/client/pipe.cpp:311 22 Set up pipe message pump for GLX
11257 PipeDraw2 eq/client/glx/windowSystem.cpp:54 22 Using glx::Pipe
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
after 27 requests (27 known processed) with 0 events remaining.
11257 PipeDraw2 ox/lunchbox/pluginRegistry.cpp:99 38 Assert: plugins.empty() [Plugin registry not de-initialized] , in:
lunchbox::abort()
lunchbox::detail::PluginRegistry::~PluginRegistry()
lunchbox::PluginRegistry::~PluginRegistry()
/lib64/libc.so.6(exit+0xe2) [0x2b7a2df34da2]
_XDefaultIOError
_XIOError
_XReply
/usr/lib64/nvidia/libGL.so.1(+0xb7f39) [0x2b79f8dccf39]
11257 PipeDraw2 rc/Lunchbox/lunchbox/debug.cpp:44 38
Aborted (core dumped)
To reproduce the bug, use the allocation described before
salloc -N1 -n1 --gres=gpu:3 -p interactive
Then use srun
to run the Xserver
srun -n1 --gres=gpu:3 -w NODE --startx --pty /bin/bash
The vglConnect to the node
vglconnect NODE
Then run eqPly
In case of a machine with two CPU processors (nodes) and three GPUs, such that two of the GPUs are connected to one processor and the third GPU connected to the other processor, when I allocate three GPUs but only one processor, seems like all possible GPU-CPU connections cannot be resolved and it causes a Segmentation Fault.
You can replicate the error when you allocate the cluster resources ( using Slurm ) with the command below:
salloc -N1 -n1 --gres=gpu:3 -p interactive -t 2:00:00,
and run eqPly from examples. As discussed in meeting this issue is directed towards Marwan.