calderpg-tri commented 3 years ago

In service of #14431 we would like to add support for OpenCl (cross-platform GPU acceleration) and OpenMP (directive-based parallelization). Both of these dependencies add runtime components, which may be more (OpenMP) or less (OpenCL) problematic for Drake users.

OpenCL

14843 adds OpenCL support for Ubuntu and Mac, used by a test in external `voxelized_geometry_tools`. We expect in the future that OpenCL will be used as part of planning code moved to Drake and thus be shipped in some/all binary forms of Drake. Broadly, our OpenCL uses the Installable Client Driver mechanism, by which our code links to the ICD loader and at runtime enumerates the available OpenCL platforms and devices. If no OpenCL platform/device is available our code will fall back to a different implementation, and thus Drake will not require OpenCL execution to be available.

Concerns/risks:

We believe that the runtime element should be minimal in the case of code that doesn't use OpenCL and that it shouldn't conflict with other software users want to integrate with Drake, but have not confirmed this yet (and doing so will require some feedback from the community).
The OpenCL execution model means that kernels are not compiled until run by a specific platform. So long as planning tools and externals are tested outside of Drake against a number of OpenCL implementations, we will not necessarily require that Drake CI support OpenCL execution (i.e. we don't need instances with GPUs). This could change in the future.
Apple has officially deprecated both OpenGL and OpenCL; however support for both continues to be available on Big Sur. Should this change in the future, we will need to remove support for OpenCL on Mac and potentially add additional Mac-specific implementation(s) of our planning tools.

OpenMP

OpenMP requires both compiler support and a runtime component. On Ubuntu platforms this is quite easy to integrate with a set of compile and link flags (although these flags differ somewhat between GCC and Clang). However, Apple does not provide the runtime library and partially disables OpenMP support in their compiler. At the very least, OpenMP support must be opt-out, whether or not it should be opt-in is a question.

Concerns/risks

OpenMP directives in our code interact with Eigen's own OpenMP integration. Conservatively, safe combination relies on the use of the EIGEN_DONT_PARALLELIZE define to disable Eigen's built-in uses.
OpenMP may interact or conflict with commercial solvers such as Gurobi and Mosek. We use OpenMP with Snopt internally and have patched interaction issues that arose, but have not extensively used it with the other commercial solvers. Mosek uses Cilk, which shouldn't directly conflict with OpenMP in terms of shared memory, but will definitely cause some sort of resource contention in the case someone puts a call to Mosek in the body of a #pragma omp parallel for loop.
If we want to add Mac support, doing so would require either a different compiler (i.e. GCC or upstream Clang from homebrew) or the use of the -Xclang option to Apple's compiler and a separately-provided release-specific version of the OpenMP runtime library.

CI and release implications

@jwnimmer-tri has enumerated some of the support matrix we'll need to consider, accounting for user channel and build options

User channel:

Source build (nightly, monthly)
GitHub binary tarball (nightly, monthly)
Homebrew binary cask (monthly)
Docker binary image (nightly, monthly)
Debian PPA binary w/sources (monthly)
Colab notebooks, likely via Debian PPA (monthly)

Build configs:

Gurobi on/off -- must be off for first-party binaries
Mosek on/off -- must be off for first-party binaries
Snopt on/off -- n.b. our first-party binaries turn this on, shrouded
Debug / Release / Coverage / Dynamic Analysis
Clang / GCC
Bionic / Focal / Catalina / Big Sur
OpenMP on/off
OpenCL on/off

We need to decide which channels will either support (or require) the various build option permutations and what coverage must exist in CI. I am putting together a survey to gather feedback of which combinations of channel/build should be supported and tested.

cc @ggould-tri @jwnimmer-tri @jamiesnape @sherm1

EricCousineau-TRI commented 3 years ago

Moved from PR:

[...] but a quick sanity check probably is still worthwhile. (If it does have downsides, we might need the option to disable it.)

Perhaps OpenCV should be part of the checklist? (From ~brief~ shallow investigations here, I think it enables OpenCL by default; dunno about static vs. dynamic linking: repro/.../opencv_cvtcolor_slow)

calderpg-tri commented 3 years ago

From a basic test program that uses OpenCV and OpenCL, I don't see any problems combining OpenCV-with-OpenCL-enabled and OpenCL.

calderpg-tri commented 3 years ago

So long as planning tools and externals are tested outside of Drake against a number of OpenCL implementations

To elaborate, right now I manually test changes to the OpenCL implementations against Nvidia, AMD, and Intel platforms. Nvidia and AMD platforms are amenable to testing through AWS via something like G4ad (AMD) and G4dn (Nvidia) instances, but I'm not aware of any instances that use Intel GPUs.

jamiesnape commented 3 years ago

So long as planning tools and externals are tested outside of Drake against a number of OpenCL implementations

To elaborate, right now I manually test changes to the OpenCL implementations against Nvidia, AMD, and Intel platforms. Nvidia and AMD platforms are amenable to testing through AWS via something like G4ad (AMD) and G4dn (Nvidia) instances, but I'm not aware of any instances that use Intel GPUs.

Factoring in hardware and software revisions, we are nearing an intractable number of implementations. If we can support two implementations of a given version of OpenCL we are probably doing well. Realistically, the Intel version would be at the bottom of my heap of versions to test. Budget is going to play into what we test too. We have some slack, but there are only so many G4 variant instances we could run in a weekly cycle (I would prioritize NVIDIA, not least because they own ARM).

calderpg-tri commented 3 years ago

Factoring in hardware and software revisions, we are nearing an intractable number of implementations. If we can support two implementations of a given version of OpenCL we are probably doing well.

I don't think we need to plan around testing on a range of (hardware x software) revisions - the most important part of testing on multiple platforms is to confirm that the OpenCL kernels build and something platform/implementation-specific doesn't sneak in. I think we can achieve that fine with a single example each of AMD and Nvidia.

Realistically, the Intel version would be at the bottom of my heap of versions to test.

Unfortunately, it's quite possible that this is the most-used implementation due to laptops. That said, I've only run into an Intel implementation-specific issue once (an ambiguous call to sqrt), so I think for now we could require that the rare changes to OpenCL kernels get manually tested on Intel instead.

jamiesnape commented 3 years ago

I don't think we need to plan around testing on a range of (hardware x software) revisions...

Yes, I just wouldn't want anyone to get a false sense of security from a given AWS instance type. OpenGL is hard enough, let alone OpenCL.

Unfortunately, it's quite possible that this is the most-used implementation due to laptops.

True, but they probably have the least to gain from using OpenCL?

calderpg-tri commented 3 years ago

Unfortunately, it's quite possible that this is the most-used implementation due to laptops.

True, but they probably have the least to gain from using OpenCL?

I have seen pretty solid speedups on NUCs and laptops for pointcloud voxelization and roadmap updating with OpenCL, especially on machines with fewer cores.

jamiesnape commented 3 years ago

Cool, nice to be proven wrong. Are are there good gains with both the GPU and CPU implementations?

calderpg-tri commented 3 years ago

Are are there good gains with both the GPU and CPU implementations?

I haven't tried them against Intel's OpenCL-on-CPU implementation, only their two GPU implementations (older beignet and newer NEO/GCR) if that's what you're asking.

jamiesnape commented 3 years ago

Yes. FWIW That may be a configuration we can handle on AWS.

jwnimmer-tri commented 2 years ago

\CC @xuchenhan-tri FYI as this might relate to FEM simulations in the future as well.

DamrongGuoy commented 2 years ago

I can see this is a big change, but I believe it will open Drake to new fruitful opportunities. Cheers!

jwnimmer-tri commented 2 years ago

A few more notes from my digging...

For users who might use MKL's libblas (instead of Ubuntu libblas) at load-time, it seems like the obvious and good things will happen by default, and we can can rely on OpenMP to sort out the details, per the MKL Developer Guide.

Mosek currently uses Cilk for the thread pool, but

Mosek version 10 will no longer employ Cilk but most likely oneTBB. This will allow for a more fine grained control on threading. -- https://groups.google.com/g/mosek/c/x2pZnW0OJEo

For background docs and good tips, see:

When solving, possibly we should detect if we're within an parallel section (per omp_in_parallel) and then set MSK_IPAR_INTPNT_MULTI_THREAD to OFF automatically, or maybe we should just document the caveat and let users configure what they need. Maybe in MOSEK 10 it will be easier.

Gurobi also consumes all threads on the machine by default: https://www.gurobi.com/documentation/9.5/refman/threads.html

I haven't yet found what kind of thread pool it's using.

RobotLocomotion / drake

Identify feature/options necessary to support OpenCL and OpenMP #14858

OpenCL

Concerns/risks:

OpenMP

Concerns/risks

CI and release implications