Closed SteveBronder closed 6 years ago
@syclik this was pretty big so I thought it would be a good idea to make it a side PR
This is a first go at making it into a singleton-ish object. It makes the startup of the OpenCL stuff less lazy in that we do instantly setup the context, but still compile the kernels as they are needed.
I was staring at stack_alloc
for inspiration, but idt my interpretation came out very pretty
Also tagging @bob-carpenter. If you have time to look at what I'm cooking over here it would be v appreciated!
@syclik I'm going to rewrite this using boost compute and will ping you when that's done
Maybe it'd help to talk about design first?
On Thu, Feb 15, 2018 at 10:42 AM, Steve Bronder notifications@github.com wrote:
@syclik https://github.com/syclik I'm going to rewrite this using boost compute and will ping you when that's done
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rok-cesnovar/math/pull/1#issuecomment-365966226, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZ_F_HPHnndNDdnBpCx9Vjgebv_tYJyks5tVFBXgaJpZM4R4Ys0 .
Yes absolutely. Let's chat tonight after the meetup
Boost Compute looks interesting and makes certain things easier, but I'm not sure it's really the best thing for us right now. The things that it makes easier:
Here's why I think it's not good for us:
Thoughts?
@SteveBronder, can you add logic_error.hpp
to the PR?
How template-crazy is it? After dealing with Spirit Qi parser, more advanced Boost features scare me.
Not template-crazy. It's macro-crazy. A whole different type of crazy.
That said, the usage pattern is pretty clean in Boost Compute.
Another concern with compute is that the last commit was Sep 11, 2017 and it looks like those failed.
It's pretty cool that it handles the GPU setup and has some other niceties, but we do that already and probably want the low level controls. Let's keep going with the current version.
@SteveBronder, I'm going to need your help. I'm walking through the setup code in depth (and rewriting it with tests). I'll check in my changes in a few, but I don't think we're configuring it right.
First thing we do is grab a platform. I'm ok assuming we have one platform on the host. I've verified that's what happens on my system. Is that right?
help!?: I'm walking through the next bit of logic where we grab all devices, check that size is not 0, then we choose the first device. I have two devices on my machine:
Why do we choose the first? What if I wanted to choose the second? Do we want to allow both? The question here is how to configure it. Before adding more code, I'd like to know how we address this.
I'm pushing my branch. I'm starting to add more comprehensive tests and they're easy to run (add STAN_OPENCL=true
to make/local):
./runTests.py test/unit/math/gpu/opencl_context_test.cpp
@rok-cesnovar, can you give me write access to this repo? I can't push my changes.
@SteveBronder, also, both devices show the same work group size.
Is there a simple utility that lists all platforms and devices? If not, perhaps we can make one that will print one out. Then the user can set certain things in configuration before building?
A lot of these can be run-time decisions, but we don't really have a way of controlling them at run-time.
I'm in a big meeting today but will try to address this in the next hour or so.
Is there a simple utility that lists all platforms and devices?
Look up the program clinfo
which should give you device and platform information
First thing we do is grab a platform. I'm ok assuming we have one platform on the host. I've verified that's what happens on my system. Is that right?
You can have more than one platform, we made it the first platform (and first device) for simplicity.
help!?: I'm walking through the next bit of logic where we grab all devices, check that size is not 0, then we choose the first device. I have two devices on my machine:
device name: Intel(R) HD Graphics 530 device name: AMD Radeon Pro 460 Compute Engine
The 15 inch mac has an integrated GPU/CPU setup (from what I remember) so that is why you see two devices
Why do we choose the first? What if I wanted to choose the second? Do we want to allow both? The question here is how to configure it. Before adding more code, I'd like to know how we address this.
The choice of the first is arbitrary. It was a placeholder until we sorted out how to grab multiple devices or specific for the context and needs to be updated
can you give me write access to this repo? I can't push my changes.
I think I can move the PR to be merged to develop (which should give you write access?)
also, both devices show the same work group size.
That's reasonable
No need to do anything in a hurry. I can keep at it without something fast.
On Fri, Feb 16, 2018 at 11:19 AM, Steve Bronder notifications@github.com wrote:
First thing we do is grab a platform. I'm ok assuming we have one platform on the host. I've verified that's what happens on my system. Is that right?
You can have more than one platform, we made it the first platform (and first device) for simplicity.
help!?: I'm walking through the next bit of logic where we grab all devices, check that size is not 0, then we choose the first device. I have two devices on my machine:
device name: Intel(R) HD Graphics 530 device name: AMD Radeon Pro 460 Compute Engine
The 15 inch mac has an integrated GPU/CPU setup (from what I remember) so that is why you see two devices
Why do we choose the first? What if I wanted to choose the second? Do we want to allow both? The question here is how to configure it. Before adding more code, I'd like to know how we address this.
The choice of the first is arbitrary. It was a placeholder until we sorted out how to grab multiple devices or specific for the context and needs to be updated
can you give me write access to this repo? I can't push my changes.
I think I can move the PR to be merged to develop (which should give you write access?)
also, both devices show the same work group size.
That's reasonable
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rok-cesnovar/math/pull/1#issuecomment-366282400, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZ_F_yvp21ehDbSB0AibNF7ThGJtRj0ks5tVaqhgaJpZM4R4Ys0 .
How should we interface for the user to grab a particular device? Just a make var?
How does this work for other OpenCL applications?
On Feb 16, 2018, at 3:49 PM, Steve Bronder notifications@github.com wrote:
How should we interface for the user to grab a particular device? Just a make var?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I'll look into how tensorflow / xgboost handles this.
Putting this here for me to play with later. Multiple devices seems pretty simple https://software.intel.com/en-us/node/540471
@syclik if your cool with it I think it's better to delete the kernels from this PR and add them later with their appropriate tests. The dummy kernel is all we need atm to verify compilation can happen
Sounds good!
On Feb 21, 2018, at 7:23 AM, Steve Bronder notifications@github.com wrote:
@syclik if your cool with it I think it's better to delete the kernels from this PR and add them later with their appropriate tests. The dummy kernel is all we need atm to verify compilation can happen
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Two things
@syclik
I made a macro that can be passed through to grab the appropriate device for the user. But fyi the code to find the xml does not work when passing things to make from ./runTests.py
./runTests.py test/unit/math/gpu/opencl_context_test.cpp STAN_OPENCL=true OPENCL_DEVICE=0
------------------------------------------------------------
make -j1 OPENCL_DEVICE=0 test/unit/math/gpu/opencl_context_test STAN_OPENCL=true
clang++ -Wall -I . -isystem lib/eigen_3.3.3 -isystem lib/boost_1.65.1 -isystem lib/cvodes_2.9.0/include -isystem lib/idas-2.1.0/include -std=c++1y -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -Wno-unused-function -Wno-uninitialized -isystem lib/opencl_1.2.8 -DSTAN_OPENCL -DOPENCL_DEVICE=0 -stdlib=libc++ -DGTEST_USE_OWN_TR1_TUPLE -DGTEST_HAS_PTHREAD=0 -isystem lib/gtest_1.7.0/include -isystem lib/gtest_1.7.0 -O3 -DGTEST_USE_OWN_TR1_TUPLE -DGTEST_HAS_PTHREAD=0 -isystem lib/gtest_1.7.0/include -isystem lib/gtest_1.7.0 -O3 -DNO_FPRINTF_OUTPUT -pipe -c -o test/unit/math/gpu/opencl_context_test.o test/unit/math/gpu/opencl_context_test.cpp
clang++ -Wall -I . -isystem lib/eigen_3.3.3 -isystem lib/boost_1.65.1 -isystem lib/cvodes_2.9.0/include -isystem lib/idas-2.1.0/include -std=c++1y -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -Wno-unused-function -Wno-uninitialized -isystem lib/opencl_1.2.8 -DSTAN_OPENCL -DOPENCL_DEVICE=0 -stdlib=libc++ -DGTEST_USE_OWN_TR1_TUPLE -DGTEST_HAS_PTHREAD=0 -isystem lib/gtest_1.7.0/include -isystem lib/gtest_1.7.0 -O3 -DNO_FPRINTF_OUTPUT -pipe -lOpenCL -o test/unit/math/gpu/opencl_context_test test/unit/math/gpu/opencl_context_test.o lib/gtest_1.7.0/src/gtest_main.cc lib/gtest_1.7.0/src/gtest-all.o
------------------------------------------------------------
OPENCL_DEVICE=0 --gtest_output="xml:OPENCL_DEVICE=0.xml"
/bin/sh: 1: --gtest_output=xml:OPENCL_DEVICE=0.xml: not found
OPENCL_DEVICE=0 --gtest_output="xml:OPENCL_DEVICE=0.xml" failed
exit now (02/21/18 08:24:52 EST)
However running
./test/unit/math/gpu/opencl_context_test
after ./runTests.py
the tests are all passing. I think this is just something with how ./runTests.py
grabs the xml. We add those to make/local
in Jenkins so this is just an fyi when you run it from terminal
@rok-cesnovar has your team looked into using multiple devices? We could probably write a macro with the above to do something like how ViennaCL grabs multiple devices here
I think I see what you've done. Should we just make that a make variable that's passed through make/local instead of passing it through the ./runTests.py
python script? I'm weary of tacking on additional complexity there (at the python level).
yes we also do that with STAN_OPENCL
. The above is just an fyi if you compile locally
that command that you posted failed, no?
The test passes (at least on my local) once it compiles, but it can't find the xml so ./runtests.py
fails.
If you run it with make it should compile and pass. On Jenkins we will pass STAN_DEVICE
to make local so we don't do anything to ./runTests.py
that call with the xml file just looks wrong... I don't think runTests is designed to pass two flags.
Ah! Yes that's probably it. Ignore the above that's just a whoopsie on my part
@syclik Are you okay with how the opencl_context
is looking? If so I'll push this back into the original PR
@SteveBronder, I've just reviewed the code again. Here's what we need to do:
opencl_context
constructor, we need to discuss. At some point, we said that the construction would be lightweight and that the initialization would be done on first use. This doesn't do that -- it's fine and it's a design decision. Let me know if that's intentional.opencl_context
constructor, update documentation.opencl_context
constructor, update tests.getInstance()
method. map_string
, map_kernel
, or map_bool
are in the public interface? Does kernels_
need to be a public member? I don't see these things as being accessed outside of this class, so I'd make it private.The overall goal is to get this into the math library. We're going through the exercise of turning prototype code into something that's acceptable to be distributed to all math users. That means all users of Stan, PyStan, CmdStan, and RStan.
The overall things we need to spend more time with this on is:
Which things on the list do you want to tackle? I can help with some of them too.
Thank you for the review!
- For the opencl_context constructor, we need to discuss. At some point, we said that the construction would be lightweight and that the initialization would be done on first use. This doesn't do that -- it's fine and it's a design decision. Let me know if that's intentional.
What are the cons of our current method? @rok-cesnovar do you have a major preference for lazy initialization vs. what we have now ("instant" (?eager?) initialization for the opencl_context
and lazy evaluation for kernel compilation)
I switched the initialization method because this seems to be the preferred way of setting up thread safe singletons in C++11. I based my approach off of the two links below:
https://stackoverflow.com/questions/335369/finding-c-static-initialization-order-problems https://stackoverflow.com/questions/34457432/c11-singleton-static-variable-is-thread-safe-why
If either of you think this is not the best or have a preferred style I'm all ears. My only concern with the current method is how this integrates with MPI.
- Remove the getInstance() method.
What would you prefer for initialization?
- Clean up the public members of the class. Is there any reason things like the typedefs map_string, map_kernel, or mapbool are in the public interface? Does kernels need to be a public member? I don't see these things as being accessed outside of this class, so I'd make it private.
I can move these to private
The overall things we need to spend more time with this on is:
- implementation of the design. I think the overall design is right (modulo cleaning up a few things). > We just need to implement it safely.
- helping future maintenance. One way to do that is a clean design (which we have). The second way to help this is by adding enough tests so we know how to instantiate it, debug it in the future, and prevent it from breaking as we move forward with the rest of the GPU features.
Yes I agree, I will write out tests for how to use (and break) all the public methods
Which things on the list do you want to tackle? I can help with some of them too.
For 1, if we need a full lazy initialization style to use GPUs across slaves with MPI can you lay out what the design of that should look like? I can tackle 2:8, though I will be on a business trip till Thursday this week, if you have any spare time feel free to tackle any of the above at your leisure
First of all, thanks to both of you!
No, I do not have a major preference regrading the instant initialization. I think that there is no arugment against doing it instantly. For the kernel compilation, I think we should stay with the lazy initialization.
I will hopefully tackle 2:8 till Thursday so just drop an eye on the changes when you get back.
I'm going to leave this as-is because I like how it looks. Is it not lightweight? You don't like the getInstance()
function?
If you would like more docs or different information lmk
Do we need to remove this?
Cleaned this all up. I rewrote this so that instead of having a bunch of different maps we have two. One for holding the kernel meta info containing
and the other is just a map for kernel name -> compiled kernel
Thanks! I've got it on my list of things to take care of soon. Thanks for making all the updates.
Just a quick note on getInstance()
. There's a way to do it with a getInstance()
method and there's a way to do it without. I think without is simpler with without, but I'll just outline the two different ways (the current implementation isn't one or the other):
without getInstance()
static opencl_context opencl_context;
. When doing this, it's important to have the constructor be quick and not throw if possible. This is where it makes sense for the opencl_context
to be in a state that's valid, but not fully initialized and having the initialization happen on first use.getInstance()
method is no longer needed at all. If you need the instance, you just grab the static instance stan::math::opencl_context
.with getInstance()
opencl_context
, the user does this at the point where they need it: opencl_context opencl_context = stan::math::opencl_context::getInstance();
. Every time.getInstance()
is called.So... I think we can just remove the getInstance()
static method and we'll be fine.
Submission Checklist
./runTests.py test/unit
make cpplint
Summary:
Fixup / rewrite of OpenCL context
Intended Effect:
How to Verify:
Side Effects:
Documentation:
Copyright and Licensing
Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company):
By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: