RobotLocomotion / drake

Model-based design and verification for robotics.
https://drake.mit.edu
Other
3.25k stars 1.25k forks source link

Identify feature/options necessary to support OpenCL and OpenMP #14858

Open calderpg-tri opened 3 years ago

calderpg-tri commented 3 years ago

In service of #14431 we would like to add support for OpenCl (cross-platform GPU acceleration) and OpenMP (directive-based parallelization). Both of these dependencies add runtime components, which may be more (OpenMP) or less (OpenCL) problematic for Drake users.

OpenCL

14843 adds OpenCL support for Ubuntu and Mac, used by a test in external voxelized_geometry_tools. We expect in the future that OpenCL will be used as part of planning code moved to Drake and thus be shipped in some/all binary forms of Drake. Broadly, our OpenCL uses the Installable Client Driver mechanism, by which our code links to the ICD loader and at runtime enumerates the available OpenCL platforms and devices. If no OpenCL platform/device is available our code will fall back to a different implementation, and thus Drake will not require OpenCL execution to be available.

Concerns/risks:

OpenMP

OpenMP requires both compiler support and a runtime component. On Ubuntu platforms this is quite easy to integrate with a set of compile and link flags (although these flags differ somewhat between GCC and Clang). However, Apple does not provide the runtime library and partially disables OpenMP support in their compiler. At the very least, OpenMP support must be opt-out, whether or not it should be opt-in is a question.

Concerns/risks

CI and release implications

@jwnimmer-tri has enumerated some of the support matrix we'll need to consider, accounting for user channel and build options

User channel:

Build configs:

We need to decide which channels will either support (or require) the various build option permutations and what coverage must exist in CI. I am putting together a survey to gather feedback of which combinations of channel/build should be supported and tested.

cc @ggould-tri @jwnimmer-tri @jamiesnape @sherm1

EricCousineau-TRI commented 3 years ago

Moved from PR:

[...] but a quick sanity check probably is still worthwhile. (If it does have downsides, we might need the option to disable it.)

Perhaps OpenCV should be part of the checklist? (From ~brief~ shallow investigations here, I think it enables OpenCL by default; dunno about static vs. dynamic linking: repro/.../opencv_cvtcolor_slow)

calderpg-tri commented 3 years ago

From a basic test program that uses OpenCV and OpenCL, I don't see any problems combining OpenCV-with-OpenCL-enabled and OpenCL.

calderpg-tri commented 3 years ago

So long as planning tools and externals are tested outside of Drake against a number of OpenCL implementations

To elaborate, right now I manually test changes to the OpenCL implementations against Nvidia, AMD, and Intel platforms. Nvidia and AMD platforms are amenable to testing through AWS via something like G4ad (AMD) and G4dn (Nvidia) instances, but I'm not aware of any instances that use Intel GPUs.

jamiesnape commented 3 years ago

So long as planning tools and externals are tested outside of Drake against a number of OpenCL implementations

To elaborate, right now I manually test changes to the OpenCL implementations against Nvidia, AMD, and Intel platforms. Nvidia and AMD platforms are amenable to testing through AWS via something like G4ad (AMD) and G4dn (Nvidia) instances, but I'm not aware of any instances that use Intel GPUs.

Factoring in hardware and software revisions, we are nearing an intractable number of implementations. If we can support two implementations of a given version of OpenCL we are probably doing well. Realistically, the Intel version would be at the bottom of my heap of versions to test. Budget is going to play into what we test too. We have some slack, but there are only so many G4 variant instances we could run in a weekly cycle (I would prioritize NVIDIA, not least because they own ARM).

calderpg-tri commented 3 years ago

Factoring in hardware and software revisions, we are nearing an intractable number of implementations. If we can support two implementations of a given version of OpenCL we are probably doing well.

I don't think we need to plan around testing on a range of (hardware x software) revisions - the most important part of testing on multiple platforms is to confirm that the OpenCL kernels build and something platform/implementation-specific doesn't sneak in. I think we can achieve that fine with a single example each of AMD and Nvidia.

Realistically, the Intel version would be at the bottom of my heap of versions to test.

Unfortunately, it's quite possible that this is the most-used implementation due to laptops. That said, I've only run into an Intel implementation-specific issue once (an ambiguous call to sqrt), so I think for now we could require that the rare changes to OpenCL kernels get manually tested on Intel instead.

jamiesnape commented 3 years ago

I don't think we need to plan around testing on a range of (hardware x software) revisions...

Yes, I just wouldn't want anyone to get a false sense of security from a given AWS instance type. OpenGL is hard enough, let alone OpenCL.

Unfortunately, it's quite possible that this is the most-used implementation due to laptops.

True, but they probably have the least to gain from using OpenCL?

calderpg-tri commented 3 years ago

Unfortunately, it's quite possible that this is the most-used implementation due to laptops.

True, but they probably have the least to gain from using OpenCL?

I have seen pretty solid speedups on NUCs and laptops for pointcloud voxelization and roadmap updating with OpenCL, especially on machines with fewer cores.

jamiesnape commented 3 years ago

Cool, nice to be proven wrong. Are are there good gains with both the GPU and CPU implementations?

calderpg-tri commented 3 years ago

Are are there good gains with both the GPU and CPU implementations?

I haven't tried them against Intel's OpenCL-on-CPU implementation, only their two GPU implementations (older beignet and newer NEO/GCR) if that's what you're asking.

jamiesnape commented 3 years ago

Yes. FWIW That may be a configuration we can handle on AWS.

jwnimmer-tri commented 2 years ago

\CC @xuchenhan-tri FYI as this might relate to FEM simulations in the future as well.

DamrongGuoy commented 2 years ago

I can see this is a big change, but I believe it will open Drake to new fruitful opportunities. Cheers!

jwnimmer-tri commented 2 years ago

A few more notes from my digging...

For users who might use MKL's libblas (instead of Ubuntu libblas) at load-time, it seems like the obvious and good things will happen by default, and we can can rely on OpenMP to sort out the details, per the MKL Developer Guide.


Mosek currently uses Cilk for the thread pool, but

Mosek version 10 will no longer employ Cilk but most likely oneTBB. This will allow for a more fine grained control on threading. -- https://groups.google.com/g/mosek/c/x2pZnW0OJEo

For background docs and good tips, see:

When solving, possibly we should detect if we're within an parallel section (per omp_in_parallel) and then set MSK_IPAR_INTPNT_MULTI_THREAD to OFF automatically, or maybe we should just document the caveat and let users configure what they need. Maybe in MOSEK 10 it will be easier.


Gurobi also consumes all threads on the machine by default: https://www.gurobi.com/documentation/9.5/refman/threads.html

See also: https://support.gurobi.com/hc/en-us/community/posts/360055837711-Solving-different-models-in-parallel-C-OpenMP-

I haven't yet found what kind of thread pool it's using.