Closed shekarman1 closed 1 year ago
Could you maybe edit your post to put your code in a proper formatted block ```cpp <code here> ```
and with indentation? Or alternative remove the code completely and attach the .cpp file? That makes it a bit more readable.
Note that in a branch a while back I've made a multi-threading test but that was never merged (I forgot the reason why). Perhaps you can try it out? Here is the branch: https://github.com/CNugteren/CLBlast/compare/master...multithreading_test (I just merged the latest master in to make it up-to-date).
First, thank you for the prompt response.
Apologies for the unformatted code. I tried attaching .cpp but github wouldn't let me. I edited the OP with the correct formatting.
And, thank you for the branch. I will check it and keep you posted.
@CNugteren I tested with the branch as you recommended. I am seeing the same behavior as before -- still getting -2039 when I run the above test example in parallel. Sequentially, it works (as before).
Thanks for testing. But I'm not sure if you did what I actually meant: compile and run the ./clblast_test_multithreading
program which is available in that branch.
But other than that, I'll try to find some time to run your code example and see if I can reproduce the issue.
I tried to reproduce the issue but I can't. I think this is because I don't have a multi-GPU system myself to test on.
A few suggestions after looking at your code that you can still try (other than run the earlier mentioned multithreading test):
clWaitForEvents(1, &event);
, as done in the CLBlast samples.clRelease*
functions in your code. When I ran it on my machine it also segfaulted at the very end. Perhaps you can try to clean-up your code: release all memory and don't use globals. In particular, you can try to use some OpenCL data-types perhaps within your thread, e.g. the streams can be made within a thread I suppose?I hope that with these suggestions you are able to debug the issue further. Let me know if any of the above is unclear.
@CNugteren Thank you for a detailed response. The use case I am trying to implement is to run a multi-threaded multi-GPU program where each thread can do its own invocation of kernels and GEMM. In response to your suggestions:
One other thing you could do is compile CLBlast in verbose mode. That way messages will be printed, giving some indication of when the errors appear and what it was doing before that.
@CNugteren An update. The problem I am encountering is related thread safety -- and not multi-GPU. I can run on multiple GPUs using a single context. As you suggested, I am building with VERBOSE to see if get any more information.
@CNugteren When I run with version of CLBlast built with VERBOSE=ON, the test case works. See attached for log. But if I run it with regular CLBlast (no VERBOSE) the test case fails with this error:
------------------------------ non VERBOSE ----------------- Testing with M 100 N 100 K 100 Platform profile: FULL_PROFILEPlatform version: OpenCL 3.0 Name: Intel(R) OpenCL Graphics Starting thread for GPU 0 Starting thread for GPU 1 CLBlast (unexpected): Internal logic error: Cache::Store: object already in cache 43:0: OpenCL err -2039 GPU 1 done GPU 0 done
OK thank you, that is useful feedback. It looks like the cache (to store compiled objects to avoid re-compilation) mutex here doesn't properly work with multi-threading. Probably if you enable verbose mode you are just lucky/unlucky to not encounter the bug. Most likely in your original program if you add a sleep/wait at the right time you can also make it work, but that is not a solution of course.
A small question. When you say:
I can run on multiple GPUs using a single context
You mean you ran the sample code from your first post here, right? What did you modify exactly to make it work on a single GPU? Then I can also try to reproduce the issue, which makes debugging easier. Thanks!
I'll investigate further.
On your question of what I changed to make the sample code work for multiple GPUs:
I upgraded the OpenCL driver (I have Intel GPUs) and Intel acknowledged that they had a bug in the older version of OpenCL driver and the newer version has a fix for it. I can dig up the driver version numbers if it is of interest.
Also, I am now compiling for OpenCL 3.0 (before the driver upgrade, I had to specify 2.2 when I compiled my code. I am not sure this is really relevant but this is one other thing I changed. And finally, I am now using clang compiler (versus gcc) -- also seems irrelevant.
Thank you for looking into this issue and your prompt responses.
I'm still not able to reproduce locally unfortunately, because I have one GPU and also in my single-GPU multi-threaded test my OpenCL driver (also Intel) crashes. However, I did look into the code a bit and it might be that I found the bug. Also the cache actions are locked by a mutex, it might still be that two threads both check after each other if object X is in the cache, then thread 1 stores X in the cache and then later thread 2 also wants to store X in the cache, because they both couldn't retrieve it from the cache at first. Actually that might be perfectly valid behaviour.
So I think the solution here might be simple: replace the throw LogicError();
here with a simple return;
.
Could you try that for me? Thanks! And do make sure speed is still OK for the second and subsequent runs of a kernel, because the cache is meant to store compiled objects that can be re-used the next time.
Good catch on the issue -- thanks. Yes, I will definitely try this and report back. It will take me a day or so to try this -- I have a long running test on the machine that has 2 Intel GPUs and it is not "free." Likely tomorrow.
@CNugteren Your suggestion to inhibit throwing the exception/error and just returning works! I am testing now. If you like, I can check the fix into a new branch and push it.
Great!
If you like, I can check the fix into a new branch and push it.
As you want. If you prefer I can also do it, no worries. If you do it, do not forget to add a small note in the CHANGELOG
file about the fix.
Go ahead please. Thanks for all your help.
Get Outlook for iOShttps://aka.ms/o0ukef
From: Cedric Nugteren @.> Sent: Friday, July 7, 2023 3:19:02 PM To: CNugteren/CLBlast @.> Cc: shekarman1 @.>; Author @.> Subject: Re: [CNugteren/CLBlast] Multi-GPU, multi-threaded invocation of CLBlastSgemm seems to be unreliable. (Issue #486)
Great!
If you like, I can check the fix into a new branch and push it.
As you want. If you prefer I can also do it, no worries. If you do it, do not forget to add a small note in the CHANGELOG file about the fix.
— Reply to this email directly, view it on GitHubhttps://github.com/CNugteren/CLBlast/issues/486#issuecomment-1625940412, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AQMGEIVPEAR56HZSQ3KLYMLXPBOKNANCNFSM6AAAAAAY4RKYT4. You are receiving this because you authored the thread.Message ID: @.***>
Hi,
Here is a test example that illustrates the problem. I am running Windows 10 and have two Intel GPUs (ARC A770) using Intel's OpenCL implementation and CLBlast (version 1.6.0). Test code given below.
M, N, K = 100 Compute Sgemm 200 times.
The test code implements two scenarios:
CLBLast (unexpected): bad allocation with error code -2039
Command lines are:
Any ideas on what I am doing wrong here? Thanks in advance.