changes to opencl_context_rewrite

SteveBronder commented 6 years ago

Submission Checklist

[ ] Run unit tests: ./runTests.py test/unit
[ ] Run cpplint: make cpplint
[ ] Declare copyright holder and open-source license: see below
[ ] Make all fixes suggest by dan in original gpu_setup PR

Summary:

Fixup / rewrite of OpenCL context

[ ] Fix comments to be doxygen

Intended Effect:

How to Verify:

Side Effects:

Documentation:

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company):

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

SteveBronder commented 6 years ago

@syclik this was pretty big so I thought it would be a good idea to make it a side PR

This is a first go at making it into a singleton-ish object. It makes the startup of the OpenCL stuff less lazy in that we do instantly setup the context, but still compile the kernels as they are needed.

I was staring at stack_alloc for inspiration, but idt my interpretation came out very pretty

SteveBronder commented 6 years ago

Also tagging @bob-carpenter. If you have time to look at what I'm cooking over here it would be v appreciated!

SteveBronder commented 6 years ago

@syclik I'm going to rewrite this using boost compute and will ping you when that's done

syclik commented 6 years ago

Maybe it'd help to talk about design first?

On Thu, Feb 15, 2018 at 10:42 AM, Steve Bronder notifications@github.com wrote:

@syclik https://github.com/syclik I'm going to rewrite this using boost compute and will ping you when that's done

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rok-cesnovar/math/pull/1#issuecomment-365966226, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZ_F_HPHnndNDdnBpCx9Vjgebv_tYJyks5tVFBXgaJpZM4R4Ys0 .

SteveBronder commented 6 years ago

Yes absolutely. Let's chat tonight after the meetup

syclik commented 6 years ago

Boost Compute looks interesting and makes certain things easier, but I'm not sure it's really the best thing for us right now. The things that it makes easier:

managing the context queue and devices
language for writing kernels; this would be a little nicer for readability
utilities for moving memory back and forth between the host and the devices written in C++

Here's why I think it's not good for us:

the kernels written as-is would be obsolete and need to be rewritten in Boost Compute's macro language
I'm not sure what magic happens in the macro language; we lose control over the finer control we would have by writing kernels directly
doc is pretty light. Here's the blog post describing how to write custom kernels from Boost Compute: http://kylelutz.blogspot.com/2014/03/custom-opencl-functions-in-c-with.html

Thoughts?

syclik commented 6 years ago

@SteveBronder, can you add logic_error.hpp to the PR?

bob-carpenter commented 6 years ago

How template-crazy is it? After dealing with Spirit Qi parser, more advanced Boost features scare me.

syclik commented 6 years ago

Not template-crazy. It's macro-crazy. A whole different type of crazy.

syclik commented 6 years ago

That said, the usage pattern is pretty clean in Boost Compute.

SteveBronder commented 6 years ago

Another concern with compute is that the last commit was Sep 11, 2017 and it looks like those failed.

https://travis-ci.org/boostorg/compute/builds/318113879?utm_source=github_status&utm_medium=notification

It's pretty cool that it handles the GPU setup and has some other niceties, but we do that already and probably want the low level controls. Let's keep going with the current version.

syclik commented 6 years ago

@SteveBronder, I'm going to need your help. I'm walking through the setup code in depth (and rewriting it with tests). I'll check in my changes in a few, but I don't think we're configuring it right.

First thing we do is grab a platform. I'm ok assuming we have one platform on the host. I've verified that's what happens on my system. Is that right?

help!?: I'm walking through the next bit of logic where we grab all devices, check that size is not 0, then we choose the first device. I have two devices on my machine:

device name: Intel(R) HD Graphics 530
device name: AMD Radeon Pro 460 Compute Engine

Why do we choose the first? What if I wanted to choose the second? Do we want to allow both? The question here is how to configure it. Before adding more code, I'd like to know how we address this.

I'm pushing my branch. I'm starting to add more comprehensive tests and they're easy to run (add STAN_OPENCL=true to make/local):

./runTests.py test/unit/math/gpu/opencl_context_test.cpp

syclik commented 6 years ago

@rok-cesnovar, can you give me write access to this repo? I can't push my changes.

syclik commented 6 years ago

@SteveBronder, also, both devices show the same work group size.

syclik commented 6 years ago

Is there a simple utility that lists all platforms and devices? If not, perhaps we can make one that will print one out. Then the user can set certain things in configuration before building?

A lot of these can be run-time decisions, but we don't really have a way of controlling them at run-time.

SteveBronder commented 6 years ago

I'm in a big meeting today but will try to address this in the next hour or so.

Is there a simple utility that lists all platforms and devices?

Look up the program clinfo which should give you device and platform information

SteveBronder commented 6 years ago

First thing we do is grab a platform. I'm ok assuming we have one platform on the host. I've verified that's what happens on my system. Is that right?

You can have more than one platform, we made it the first platform (and first device) for simplicity.

help!?: I'm walking through the next bit of logic where we grab all devices, check that size is not 0, then we choose the first device. I have two devices on my machine:

device name: Intel(R) HD Graphics 530 device name: AMD Radeon Pro 460 Compute Engine

The 15 inch mac has an integrated GPU/CPU setup (from what I remember) so that is why you see two devices

Why do we choose the first? What if I wanted to choose the second? Do we want to allow both? The question here is how to configure it. Before adding more code, I'd like to know how we address this.

The choice of the first is arbitrary. It was a placeholder until we sorted out how to grab multiple devices or specific for the context and needs to be updated

can you give me write access to this repo? I can't push my changes.

I think I can move the PR to be merged to develop (which should give you write access?)

also, both devices show the same work group size.

That's reasonable

syclik commented 6 years ago

No need to do anything in a hurry. I can keep at it without something fast.

On Fri, Feb 16, 2018 at 11:19 AM, Steve Bronder notifications@github.com wrote:

First thing we do is grab a platform. I'm ok assuming we have one platform on the host. I've verified that's what happens on my system. Is that right?

You can have more than one platform, we made it the first platform (and first device) for simplicity.

help!?: I'm walking through the next bit of logic where we grab all devices, check that size is not 0, then we choose the first device. I have two devices on my machine:

device name: Intel(R) HD Graphics 530 device name: AMD Radeon Pro 460 Compute Engine

The 15 inch mac has an integrated GPU/CPU setup (from what I remember) so that is why you see two devices

Why do we choose the first? What if I wanted to choose the second? Do we want to allow both? The question here is how to configure it. Before adding more code, I'd like to know how we address this.

The choice of the first is arbitrary. It was a placeholder until we sorted out how to grab multiple devices or specific for the context and needs to be updated

can you give me write access to this repo? I can't push my changes.

I think I can move the PR to be merged to develop (which should give you write access?)

also, both devices show the same work group size.

That's reasonable

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rok-cesnovar/math/pull/1#issuecomment-366282400, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZ_F_yvp21ehDbSB0AibNF7ThGJtRj0ks5tVaqhgaJpZM4R4Ys0 .

SteveBronder commented 6 years ago

How should we interface for the user to grab a particular device? Just a make var?

syclik commented 6 years ago

How does this work for other OpenCL applications?

On Feb 16, 2018, at 3:49 PM, Steve Bronder notifications@github.com wrote:

How should we interface for the user to grab a particular device? Just a make var?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

SteveBronder commented 6 years ago

I'll look into how tensorflow / xgboost handles this.

Putting this here for me to play with later. Multiple devices seems pretty simple https://software.intel.com/en-us/node/540471

SteveBronder commented 6 years ago

@syclik if your cool with it I think it's better to delete the kernels from this PR and add them later with their appropriate tests. The dummy kernel is all we need atm to verify compilation can happen

syclik commented 6 years ago

Sounds good!

On Feb 21, 2018, at 7:23 AM, Steve Bronder notifications@github.com wrote:

@syclik if your cool with it I think it's better to delete the kernels from this PR and add them later with their appropriate tests. The dummy kernel is all we need atm to verify compilation can happen

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

SteveBronder commented 6 years ago

Two things

@syclik I made a macro that can be passed through to grab the appropriate device for the user. But fyi the code to find the xml does not work when passing things to make from ./runTests.py

./runTests.py test/unit/math/gpu/opencl_context_test.cpp STAN_OPENCL=true OPENCL_DEVICE=0
------------------------------------------------------------
make -j1 OPENCL_DEVICE=0 test/unit/math/gpu/opencl_context_test STAN_OPENCL=true
clang++ -Wall -I . -isystem lib/eigen_3.3.3 -isystem lib/boost_1.65.1 -isystem lib/cvodes_2.9.0/include -isystem lib/idas-2.1.0/include -std=c++1y -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -Wno-unused-function -Wno-uninitialized -isystem lib/opencl_1.2.8 -DSTAN_OPENCL -DOPENCL_DEVICE=0 -stdlib=libc++ -DGTEST_USE_OWN_TR1_TUPLE -DGTEST_HAS_PTHREAD=0 -isystem lib/gtest_1.7.0/include -isystem lib/gtest_1.7.0 -O3 -DGTEST_USE_OWN_TR1_TUPLE -DGTEST_HAS_PTHREAD=0 -isystem lib/gtest_1.7.0/include -isystem lib/gtest_1.7.0 -O3 -DNO_FPRINTF_OUTPUT -pipe  -c -o test/unit/math/gpu/opencl_context_test.o test/unit/math/gpu/opencl_context_test.cpp
clang++ -Wall -I . -isystem lib/eigen_3.3.3 -isystem lib/boost_1.65.1 -isystem lib/cvodes_2.9.0/include -isystem lib/idas-2.1.0/include -std=c++1y -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -Wno-unused-function -Wno-uninitialized -isystem lib/opencl_1.2.8 -DSTAN_OPENCL -DOPENCL_DEVICE=0 -stdlib=libc++ -DGTEST_USE_OWN_TR1_TUPLE -DGTEST_HAS_PTHREAD=0 -isystem lib/gtest_1.7.0/include -isystem lib/gtest_1.7.0 -O3 -DNO_FPRINTF_OUTPUT -pipe -lOpenCL  -o test/unit/math/gpu/opencl_context_test test/unit/math/gpu/opencl_context_test.o lib/gtest_1.7.0/src/gtest_main.cc lib/gtest_1.7.0/src/gtest-all.o 
------------------------------------------------------------
OPENCL_DEVICE=0 --gtest_output="xml:OPENCL_DEVICE=0.xml"
/bin/sh: 1: --gtest_output=xml:OPENCL_DEVICE=0.xml: not found
OPENCL_DEVICE=0 --gtest_output="xml:OPENCL_DEVICE=0.xml" failed
exit now (02/21/18 08:24:52 EST)

However running

./test/unit/math/gpu/opencl_context_test

after ./runTests.py the tests are all passing. I think this is just something with how ./runTests.py grabs the xml. We add those to make/local in Jenkins so this is just an fyi when you run it from terminal

@rok-cesnovar has your team looked into using multiple devices? We could probably write a macro with the above to do something like how ViennaCL grabs multiple devices here

syclik commented 6 years ago

I think I see what you've done. Should we just make that a make variable that's passed through make/local instead of passing it through the ./runTests.py python script? I'm weary of tacking on additional complexity there (at the python level).

SteveBronder commented 6 years ago

yes we also do that with STAN_OPENCL. The above is just an fyi if you compile locally

syclik commented 6 years ago

that command that you posted failed, no?

SteveBronder commented 6 years ago

The test passes (at least on my local) once it compiles, but it can't find the xml so ./runtests.py fails.

If you run it with make it should compile and pass. On Jenkins we will pass STAN_DEVICE to make local so we don't do anything to ./runTests.py

syclik commented 6 years ago

that call with the xml file just looks wrong... I don't think runTests is designed to pass two flags.

SteveBronder commented 6 years ago

Ah! Yes that's probably it. Ignore the above that's just a whoopsie on my part

SteveBronder commented 6 years ago

@syclik Are you okay with how the opencl_context is looking? If so I'll push this back into the original PR

syclik commented 6 years ago

@SteveBronder, I've just reviewed the code again. Here's what we need to do:

For the opencl_context constructor, we need to discuss. At some point, we said that the construction would be lightweight and that the initialization would be done on first use. This doesn't do that -- it's fine and it's a design decision. Let me know if that's intentional.
For the opencl_context constructor, update documentation.
For the opencl_context constructor, update tests.
Remove the getInstance() method.
Clean up the public members of the class. Is there any reason things like the typedefs map_string, map_kernel, or map_bool are in the public interface? Does kernels_ need to be a public member? I don't see these things as being accessed outside of this class, so I'd make it private.
For consistent style with the math code base, move the separate declare and define style of the code to inline declare and define methods. (The only time I'd go against that for math is if we needed the advance declaration for some reason, but we don't here.)
update doc
unit test everything public.

The overall goal is to get this into the math library. We're going through the exercise of turning prototype code into something that's acceptable to be distributed to all math users. That means all users of Stan, PyStan, CmdStan, and RStan.

The overall things we need to spend more time with this on is:

implementation of the design. I think the overall design is right (modulo cleaning up a few things). We just need to implement it safely.
helping future maintenance. One way to do that is a clean design (which we have). The second way to help this is by adding enough tests so we know how to instantiate it, debug it in the future, and prevent it from breaking as we move forward with the rest of the GPU features.

Which things on the list do you want to tackle? I can help with some of them too.

SteveBronder commented 6 years ago

Thank you for the review!

For the opencl_context constructor, we need to discuss. At some point, we said that the construction would be lightweight and that the initialization would be done on first use. This doesn't do that -- it's fine and it's a design decision. Let me know if that's intentional.

What are the cons of our current method? @rok-cesnovar do you have a major preference for lazy initialization vs. what we have now ("instant" (?eager?) initialization for the opencl_context and lazy evaluation for kernel compilation)

I switched the initialization method because this seems to be the preferred way of setting up thread safe singletons in C++11. I based my approach off of the two links below:

https://stackoverflow.com/questions/335369/finding-c-static-initialization-order-problems https://stackoverflow.com/questions/34457432/c11-singleton-static-variable-is-thread-safe-why

If either of you think this is not the best or have a preferred style I'm all ears. My only concern with the current method is how this integrates with MPI.

Remove the getInstance() method.

What would you prefer for initialization?

Clean up the public members of the class. Is there any reason things like the typedefs map_string, map_kernel, or mapbool are in the public interface? Does kernels need to be a public member? I don't see these things as being accessed outside of this class, so I'd make it private.

I can move these to private

The overall things we need to spend more time with this on is:

implementation of the design. I think the overall design is right (modulo cleaning up a few things). > We just need to implement it safely.

helping future maintenance. One way to do that is a clean design (which we have). The second way to help this is by adding enough tests so we know how to instantiate it, debug it in the future, and prevent it from breaking as we move forward with the rest of the GPU features.

Yes I agree, I will write out tests for how to use (and break) all the public methods

Which things on the list do you want to tackle? I can help with some of them too.

For 1, if we need a full lazy initialization style to use GPUs across slaves with MPI can you lay out what the design of that should look like? I can tackle 2:8, though I will be on a business trip till Thursday this week, if you have any spare time feel free to tackle any of the above at your leisure

rok-cesnovar commented 6 years ago

First of all, thanks to both of you!

No, I do not have a major preference regrading the instant initialization. I think that there is no arugment against doing it instantly. For the kernel compilation, I think we should stay with the lazy initialization.

I will hopefully tackle 2:8 till Thursday so just drop an eye on the changes when you get back.

SteveBronder commented 6 years ago

[x] For the opencl_context constructor, we need to discuss. At some point, we said that the construction would be lightweight and that the initialization would be done on first use. This doesn't do that -- it's fine and it's a design decision. Let me know if that's intentional.

I'm going to leave this as-is because I like how it looks. Is it not lightweight? You don't like the getInstance() function?

[x] For the opencl_context constructor, update documentation.

If you would like more docs or different information lmk

[x] For the opencl_context constructor, update tests.
[ ] Remove the getInstance() method.

Do we need to remove this?

[x] Clean up the public members of the class. Is there any reason things like the typedefs map_string, map_kernel, or mapbool are in the public interface? Does kernels need to be a public member? I don't see these things as being accessed outside of this class, so I'd make it private.

Cleaned this all up. I rewrote this so that instead of having a bunch of different maps we have two. One for holding the kernel meta info containing

whether the kernel was compiled
what group the kernel is in
the raw code for the kernel)

and the other is just a map for kernel name -> compiled kernel

[x] For consistent style with the math code base, move the separate declare and define style of the code to inline declare and define methods. (The only time I'd go against that for math is if we needed the advance declaration for some reason, but we don't here.)
[x] update doc
[x] unit test everything public.

syclik commented 6 years ago

Thanks! I've got it on my list of things to take care of soon. Thanks for making all the updates.

Just a quick note on getInstance(). There's a way to do it with a getInstance() method and there's a way to do it without. I think without is simpler with without, but I'll just outline the two different ways (the current implementation isn't one or the other):

without getInstance()

The static instance is created using the default constructor. So in opencl_context.hpp line 342, it should be just static opencl_context opencl_context;. When doing this, it's important to have the constructor be quick and not throw if possible. This is where it makes sense for the opencl_context to be in a state that's valid, but not fully initialized and having the initialization happen on first use.
The getInstance() method is no longer needed at all. If you need the instance, you just grab the static instance stan::math::opencl_context.

with getInstance()

There is no static instance. That means line 342 should just be removed.
Whenever a user needs the opencl_context, the user does this at the point where they need it: opencl_context opencl_context = stan::math::opencl_context::getInstance();. Every time.
If you looked at some of your StackOverflow links, this is where the class then has a static pointer that gets initialized appropriately and that's passed back whenever getInstance() is called.
I think it's a bit more complicated than what we need, but it can be done this way.

So... I think we can just remove the getInstance() static method and we'll be fine.

bstatcomp / math