flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

update hwloc packages for GPU support #1511

Closed garlick closed 6 years ago

garlick commented 6 years ago

According to @trws, hwloc-1.11.10 (and possibly some earlier versions), when built with cuda support, will include GPU info in the XML it generates. flux-sched can then read this from the KVS and use it to schedule GPU's.

Current TOSS 3 hwloc (checked pascal vis system) is 1.11.2.

To update to 1.11.10 seems like no problem, but to build with cuda support would require it to depend on one of the cudatoolkit versions packaged in /opt/cudatoolkit with environment-modules. I'm not sure if this is a show stopper, e.g. does it mean that hwloc and flux-core would then have to move under the /opt regime (e.g. load the flux-core environment module which depends on the hwloc one which depends on the cuda one?)

garlick commented 6 years ago

Update from @trws: 1.11.9 with cuda doesn't seem to enumerate the GPU devices, on sierra at least.

grondo commented 6 years ago

Is this a case where we might need a subpackage for a module that overrides our built-in hwloc?

I'd also hate to keep bumping our "required" version such that flux-core can't build on any modern distro without pulling down and rebuilding hwloc, even though the system may not have any fancy GPU coprocessors.

trws commented 6 years ago

That might be a better solution. Flux itself does’t require anything newer than 1.11.1, and the older versions of 1.11 (and even much farther back actually, I used some of this more than 3 years ago) work even for coprocessors, just not on sierra apparently. In fact just to make sure, I ran the 1.11.9 version that didn’t work on sierra on Ray just now and it enumerates everything just fine, so only sierra really cares that it’s that new, but the other cuda machines will care that it’s built with cuda support.

On 8 May 2018, at 13:40, Mark Grondona wrote:

Is this a case where we might need a subpackage for a module that overrides our built-in hwloc?

I'd also hate to keep bumping our "required" version such that flux-core can't build on any modern distro without pulling dow and rebuilding hwloc, even though the system may not have any fancy GPU coprocessors.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/flux-framework/flux-core/issues/1511#issuecomment-387535484

grondo commented 6 years ago

Pinned to 0.10.0 milestone.

I'm still a little unclear on why we can't build an hwloc with CUDA + OpenCL support in /usr/tce (or /opt), then module load that version before flux start and have our resource-hwloc module do the right thing.

Is there really a binary imcompatibility between hwloc 1.11 built with and without CUDA support?

Sorry if I'm being dense.

dongahn commented 6 years ago

@grondo: I thought the plan was actually to bundle my hwloc with CUDA + OpenCL in /usr/tcetmp and do module load. I will follow up.

dongahn commented 6 years ago

Also FYI -- unfortunately, we will need to use LD_PRELOAD in the module instead of LD_LIBRARY_PATH because jsrun prepends their libraries including its hwloc to LD_LIBRARY_PATH.

grondo commented 6 years ago

Ah, maybe that is where I was confused. Sorry!

trws commented 6 years ago

There is not a binary incompatibility, as far as I know, but it would avoid having to LD_PRELOAD the thing. John tells me he has not installed one, but if I give him commands or a tarball he can put it up on sierra/butte etc. quickly.

dongahn commented 6 years ago

@trws: Roy Mussleman is also across my hallway and I will work with him to create a module off of my /usr/global build.

trws commented 6 years ago

Does that have opencl support in it as well?

On 12 Jul 2018, at 14:17, Dong H. Ahn wrote:

@trws: Roy Mussleman is also across my hallway and I will work with him to create a module off of my /usr/global build.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/flux-framework/flux-core/issues/1511#issuecomment-404653751

dongahn commented 6 years ago

I will need to rebuild it w/ --enable-opencl before working the module.

dongahn commented 6 years ago

rzansel61{dahn}24: ../configure --prefix=/collab/usr/global/tools/hwloc/blueos_3_ppc64le_ib/hwloc-1.11.10-cuda --enable-cuda --enable-opencl CPPFLAGS=-I/usr/tce/packages/cuda/cuda-9.2.88/include LDFLAGS=-L/usr/tce/packages/cuda/cuda-9.2.88/lib64

<CUT>

configure: WARNING: Specified --enable-opencl switch, but could not
configure: WARNING: find appropriate support
configure: error: Cannot continue
``

So what am I missing? It configures ok with only `--enable-cuda` given.
dongahn commented 6 years ago

1568: Packaging up a hwloc won't necessarily solve this problem. flux has issues with LD_PRELOAD=cuda-enabled-hwloc.

trws commented 6 years ago

Odd, the opencl library is part of the cuda driver package, so I would have expected it to be found by default. Maybe it isn’t in the usual default paths?

On 12 Jul 2018, at 18:18, Dong H. Ahn wrote:


rzansel61{dahn}24: ../configure 
--prefix=/collab/usr/global/tools/hwloc/blueos_3_ppc64le_ib/hwloc-1.11.10-cuda 
--enable-cuda --enable-opencl 
CPPFLAGS=-I/usr/tce/packages/cuda/cuda-9.2.88/include 
LDFLAGS=-L/usr/tce/packages/cuda/cuda-9.2.88/lib64

<CUT>

configure: WARNING: Specified --enable-opencl switch, but could not
configure: WARNING: find appropriate support
configure: error: Cannot continue
``

So what am I missing? It configures ok with only `--enable-cuda` 
given.

-- 
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/1511#issuecomment-404696785
dongahn commented 6 years ago

As mentioned in Issue #1568, I create /usr/tcetmp module: hwloc/1.11.10-cuda

dongahn commented 6 years ago

From Max, I found /usr/lib64/nvidia has OpenCL libraries. But still the same issue,

rzansel61{dahn}40: ../configure --prefix=/collab/usr/global/tools/hwloc/blueos_3_ppc64le_ib/hwloc-1.11.10-cuda --enable-cuda --enable-opencl CPPFLAGS=-I/usr/tce/packages/cuda/cuda-9.2.88/include LDFLAGS="-L/usr/tce/packages/cuda/cuda-9.2.88/lib64 -L/usr/lib64/nvidia"

configure: WARNING: Specified --enable-opencl switch, but could not
configure: WARNING: find appropriate support
configure: error: Cannot continue

At this point, I am a bit reluctant to take a further dive into this since I don't have any immediate need for OpenCL. @trws: will it be okay to close this for now. We will surely revisit OpenCL when we are dealing with AMD ROCm.

trws commented 6 years ago

Works for me.

On 13 Jul 2018, at 12:41, Dong H. Ahn wrote:

From Max, I found /usr/lib64/nvidia has OpenCL libraries. But still the same issue,

rzansel61{dahn}40: ../configure 
--prefix=/collab/usr/global/tools/hwloc/blueos_3_ppc64le_ib/hwloc-1.11.10-cuda 
--enable-cuda --enable-opencl 
CPPFLAGS=-I/usr/tce/packages/cuda/cuda-9.2.88/include 
LDFLAGS="-L/usr/tce/packages/cuda/cuda-9.2.88/lib64 
-L/usr/lib64/nvidia"

configure: WARNING: Specified --enable-opencl switch, but could not
configure: WARNING: find appropriate support
configure: error: Cannot continue

At this point, I am a bit reluctant to take a further dive into this since I don't have any immediate need for OpenCL. @trws: will it be okay to close this for now. We will surely revisit OpenCL when we are dealing with AMD ROCm.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/flux-framework/flux-core/issues/1511#issuecomment-404933831