Closed garlick closed 6 years ago
Update from @trws: 1.11.9 with cuda doesn't seem to enumerate the GPU devices, on sierra at least.
Is this a case where we might need a subpackage for a module that overrides our built-in hwloc?
I'd also hate to keep bumping our "required" version such that flux-core can't build on any modern distro without pulling down and rebuilding hwloc, even though the system may not have any fancy GPU coprocessors.
That might be a better solution. Flux itself does’t require anything newer than 1.11.1, and the older versions of 1.11 (and even much farther back actually, I used some of this more than 3 years ago) work even for coprocessors, just not on sierra apparently. In fact just to make sure, I ran the 1.11.9 version that didn’t work on sierra on Ray just now and it enumerates everything just fine, so only sierra really cares that it’s that new, but the other cuda machines will care that it’s built with cuda support.
On 8 May 2018, at 13:40, Mark Grondona wrote:
Is this a case where we might need a subpackage for a module that overrides our built-in hwloc?
I'd also hate to keep bumping our "required" version such that flux-core can't build on any modern distro without pulling dow and rebuilding hwloc, even though the system may not have any fancy GPU coprocessors.
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/flux-framework/flux-core/issues/1511#issuecomment-387535484
Pinned to 0.10.0 milestone.
I'm still a little unclear on why we can't build an hwloc with CUDA + OpenCL support in /usr/tce (or /opt), then module load
that version before flux start
and have our resource-hwloc
module do the right thing.
Is there really a binary imcompatibility between hwloc 1.11 built with and without CUDA support?
Sorry if I'm being dense.
@grondo: I thought the plan was actually to bundle my hwloc with CUDA + OpenCL in /usr/tcetmp and do module load
. I will follow up.
Also FYI -- unfortunately, we will need to use LD_PRELOAD
in the module instead of LD_LIBRARY_PATH
because jsrun prepends their libraries including its hwloc to LD_LIBRARY_PATH
.
Ah, maybe that is where I was confused. Sorry!
There is not a binary incompatibility, as far as I know, but it would avoid having to LD_PRELOAD the thing. John tells me he has not installed one, but if I give him commands or a tarball he can put it up on sierra/butte etc. quickly.
@trws: Roy Mussleman is also across my hallway and I will work with him to create a module off of my /usr/global build.
Does that have opencl support in it as well?
On 12 Jul 2018, at 14:17, Dong H. Ahn wrote:
@trws: Roy Mussleman is also across my hallway and I will work with him to create a module off of my /usr/global build.
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/flux-framework/flux-core/issues/1511#issuecomment-404653751
I will need to rebuild it w/ --enable-opencl before working the module.
rzansel61{dahn}24: ../configure --prefix=/collab/usr/global/tools/hwloc/blueos_3_ppc64le_ib/hwloc-1.11.10-cuda --enable-cuda --enable-opencl CPPFLAGS=-I/usr/tce/packages/cuda/cuda-9.2.88/include LDFLAGS=-L/usr/tce/packages/cuda/cuda-9.2.88/lib64
<CUT>
configure: WARNING: Specified --enable-opencl switch, but could not
configure: WARNING: find appropriate support
configure: error: Cannot continue
``
So what am I missing? It configures ok with only `--enable-cuda` given.
Odd, the opencl library is part of the cuda driver package, so I would have expected it to be found by default. Maybe it isn’t in the usual default paths?
On 12 Jul 2018, at 18:18, Dong H. Ahn wrote:
rzansel61{dahn}24: ../configure --prefix=/collab/usr/global/tools/hwloc/blueos_3_ppc64le_ib/hwloc-1.11.10-cuda --enable-cuda --enable-opencl CPPFLAGS=-I/usr/tce/packages/cuda/cuda-9.2.88/include LDFLAGS=-L/usr/tce/packages/cuda/cuda-9.2.88/lib64 <CUT> configure: WARNING: Specified --enable-opencl switch, but could not configure: WARNING: find appropriate support configure: error: Cannot continue `` So what am I missing? It configures ok with only `--enable-cuda` given. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/flux-framework/flux-core/issues/1511#issuecomment-404696785
As mentioned in Issue #1568, I create /usr/tcetmp module: hwloc/1.11.10-cuda
From Max, I found /usr/lib64/nvidia has OpenCL libraries. But still the same issue,
rzansel61{dahn}40: ../configure --prefix=/collab/usr/global/tools/hwloc/blueos_3_ppc64le_ib/hwloc-1.11.10-cuda --enable-cuda --enable-opencl CPPFLAGS=-I/usr/tce/packages/cuda/cuda-9.2.88/include LDFLAGS="-L/usr/tce/packages/cuda/cuda-9.2.88/lib64 -L/usr/lib64/nvidia"
configure: WARNING: Specified --enable-opencl switch, but could not
configure: WARNING: find appropriate support
configure: error: Cannot continue
At this point, I am a bit reluctant to take a further dive into this since I don't have any immediate need for OpenCL. @trws: will it be okay to close this for now. We will surely revisit OpenCL when we are dealing with AMD ROCm.
Works for me.
On 13 Jul 2018, at 12:41, Dong H. Ahn wrote:
From Max, I found /usr/lib64/nvidia has OpenCL libraries. But still the same issue,
rzansel61{dahn}40: ../configure --prefix=/collab/usr/global/tools/hwloc/blueos_3_ppc64le_ib/hwloc-1.11.10-cuda --enable-cuda --enable-opencl CPPFLAGS=-I/usr/tce/packages/cuda/cuda-9.2.88/include LDFLAGS="-L/usr/tce/packages/cuda/cuda-9.2.88/lib64 -L/usr/lib64/nvidia" configure: WARNING: Specified --enable-opencl switch, but could not configure: WARNING: find appropriate support configure: error: Cannot continue
At this point, I am a bit reluctant to take a further dive into this since I don't have any immediate need for OpenCL. @trws: will it be okay to close this for now. We will surely revisit OpenCL when we are dealing with AMD ROCm.
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/flux-framework/flux-core/issues/1511#issuecomment-404933831
According to @trws, hwloc-1.11.10 (and possibly some earlier versions), when built with cuda support, will include GPU info in the XML it generates. flux-sched can then read this from the KVS and use it to schedule GPU's.
Current TOSS 3 hwloc (checked pascal vis system) is 1.11.2.
To update to 1.11.10 seems like no problem, but to build with cuda support would require it to depend on one of the cudatoolkit versions packaged in
/opt/cudatoolkit
with environment-modules. I'm not sure if this is a show stopper, e.g. does it mean that hwloc and flux-core would then have to move under the /opt regime (e.g. load the flux-core environment module which depends on the hwloc one which depends on the cuda one?)