Sarbojit2019 commented 2 years ago

I have written small test to use two streams for submitting two independent kernels. When I run the test as it is, it gets hung all the time. I don't see any HIP API is stuck tough but some how it does not complete. I have to kill the process every time. Something interesting I noticed, when I comment out "CHECK_STATUS(hipStreamSynchronize(stream1));" line test passes.

My module list: sarbojit@s001-n002:~/src/my_tests> module list Currently Loaded Modulefiles: 1) spack/linux-opensuse_leap15-x86_64(default) 4) intel_gpu_env/implicit_scaling 7) intel_compute_runtime/release/agama-prerelease-519 10) source-highlight/3.1.9-gcc-10.2.0-q7luky6 2) gcc/10.2.0-gcc-10.2.0-yudlyez 5) intel_gpu_env/multi_gpu 8) cmake/3.23.2-gcc-10.2.0-luxkr43 11) gdb/9.2-gcc-10.2.0-usixmvz 3) .igc/release/20220711.1 6) intel_gpu_env/platform_default 9) boost/1.74.0-gcc-10.2.0-lwc57tn

#include <iostream>
#include <hip/hip_runtime.h>

#define CHECK_STATUS(cmd)                                                      \
  {                                                                            \
    hipError_t error = cmd;                                                    \
    if (error != hipSuccess) {                                                 \
      fprintf(stderr, "error: '%s'(%d) at %s:%d\n", hipGetErrorName(error),    \
              error, __FILE__, __LINE__);                                      \
      exit(1);                                                                 \
    }                                                                          \
  }

__global__ void addOne(int *__restrict A) {
  const uint i = hipBlockDim_x * hipBlockIdx_x + hipThreadIdx_x;
  A[i] = A[i] + 1;
}

 // -----------> Validate stream semantics <-----------------------
 /* 1. Streams work independently
       i>   Create two stream
       ii>  Load tasks into first & second streams
       iii> add stream sync on first stream and wait for it to complete
       iv>
  */
bool test_streams_independent_exe() {
    // Create streams
    hipStream_t stream1, stream2;
    CHECK_STATUS(hipStreamCreate(&stream1));
    CHECK_STATUS(hipStreamCreate(&stream2));

    // Push work to stream1 & stream2
    int *pDev_4_strm1 = nullptr;
    CHECK_STATUS(hipMalloc(&pDev_4_strm1, sizeof(int)));

    int mem_sz = 1;
    int *pDev_4_strm2 = nullptr;
    CHECK_STATUS(hipMalloc(&pDev_4_strm2, mem_sz*sizeof(int)));

    hipLaunchKernelGGL(addOne, 1, 1, 0, stream1, pDev_4_strm1);
    hipLaunchKernelGGL(addOne, 1, 1, 0, stream2, pDev_4_strm2);

    CHECK_STATUS(hipStreamSynchronize(stream1)); **// <- Commenting this line allows test to complete**    
    hipError_t status = hipStreamQuery(stream2);
    printf("status %s\n", hipGetErrorName(status));

    return true;
}

int main() {
    printf("Test result : %d\n",test_streams_independent_exe());
    hipDeviceSynchronize();
    printf("Passed\n");
    return 0; //<- always stuck here
}

Sarbojit2019 commented 2 years ago

Call stack shows one of the runtime thread is not getting exited hence the hang.

Thread 4 (Thread 0x7fffe7f27700 (LWP 72400)):

0 0x00007ffff6a13cc1 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6

1 0x00007ffff6a199c3 in nanosleep () from /lib64/libc.so.6

2 0x00007ffff6a46144 in usleep () from /lib64/libc.so.6

3 0x00007ffff7b86972 in CHIPCallbackEventMonitorLevel0::monitor (this=0x1083290) at /home/sarbojit/src/chip-spv/src/backend/Level0/CHIPBackendLevel0.cc:510

4 0x00007ffff7b7b047 in CHIPEventMonitor::monitorWrapper (Arg=0x0) at /home/sarbojit/src/chip-spv/src/CHIPBackend.hh:319

5 0x00007ffff671f6ea in start_thread () from /lib64/libpthread.so.0

6 0x00007ffff6a4fa6f in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fffe8b28700 (LWP 72399)): //<- This thread never gets completed even after device synchronize

0 0x00007ffff6a13cc1 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6 [0/67]

1 0x00007ffff6a199c3 in nanosleep () from /lib64/libc.so.6

2 0x00007ffff6a46144 in usleep () from /lib64/libc.so.6

3 0x00007ffff7b86cfa in CHIPStaleEventMonitorLevel0::monitor (this=0x10830d0) at /home/sarbojit/src/chip-spv/src/backend/Level0/CHIPBackendLevel0.cc:561

4 0x00007ffff7b7b047 in CHIPEventMonitor::monitorWrapper (Arg=0x0) at /home/sarbojit/src/chip-spv/src/CHIPBackend.hh:319

5 0x00007ffff671f6ea in start_thread () from /lib64/libpthread.so.0

6 0x00007ffff6a4fa6f in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fffe9729700 (LWP 72398)):

0 0x00007ffff6a13cc1 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6

1 0x00007ffff6a199c3 in nanosleep () from /lib64/libc.so.6

2 0x00007ffff5a0c8a5 in ?? () from /soft/restricted/CNDA/emb/libraries/intel-level-zero/compute-runtime/20220711.1/lib64/libze_intel_gpu.so.1

3 0x00007ffff5a0c910 in ?? () from /soft/restricted/CNDA/emb/libraries/intel-level-zero/compute-runtime/20220711.1/lib64/libze_intel_gpu.so.1

--Type for more, q to quit, c to continue without paging--c

4 0x00007ffff671f6ea in start_thread () from /lib64/libpthread.so.0

5 0x00007ffff6a4fa6f in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7ffff7fdb2c0 (LWP 72377)):

0 hipDeviceSynchronize () at /home/sarbojit/src/chip-spv/src/CHIPBindings.cc:785

1 0x000000000040115f in main ()

(gdb) c Continuing. Passed [Thread 0x7fffe7f27700 (LWP 72400) exited]

pvelesko commented 2 years ago

Could you please make a PR for this test

pvelesko commented 2 years ago

Fix:

diff --git a/src/CHIPBackend.hh b/src/CHIPBackend.hh
index 7ef8016e..5a332533 100644
--- a/src/CHIPBackend.hh
+++ b/src/CHIPBackend.hh
@@ -2027,7 +2027,8 @@ public:
     if (!LastEvent_)
       return true;

-    LastEvent_->updateFinishStatus(false);
+    if(LastEvent_->updateFinishStatus(false))
+      LastEvent_->decreaseRefCount("Event became ready");
     if (LastEvent_->isFinished())
       return true;

pvelesko commented 2 years ago

Also, is this test still a WIP? because it's not actually testing stream semantics. You request a sync on stream1 which is a blocking stream so it will sync with default stream (which is empty since nothing was ever submitted to it) and then kernel launch 2 will do the same.

Sarbojit2019 commented 2 years ago

This is WIP progress test.

Intent of the test is to verify two independent tasks can run on two different streams. To prove the same I had planned to put small task on stream1 and longer running task on stream2, hence query stream2 status should show hipErrorNotReady as synchronization is made to stream1 only.

As part of stream semantics test I want to cover few more scenarios. I will update complete test as PR once I complete it.

pvelesko commented 2 years ago

put small task on stream1 and longer running task on stream2

@pjaaskel is not gonna be happy about this... 🥲

As part of stream semantics test I want to cover few more scenarios

Could you do the stream sync test next since we have an open ticket on that issue? It would be nice to have a definite test asap.

Sarbojit2019 commented 2 years ago

Well there is no clock() or clock64() implementation for CHIP_SPV hence there is no way to block a kernel for a definite time interval because of that we have to rely on this long/short tasks.

Let me try to get it ready ASAP, this hang blocked me today.

pjaaskel commented 2 years ago

@pjaaskel is not gonna be happy about this... smiling_face_with_tear

Gentlemen... please spend a bit of time and try to understand my message. The motivation should not be about "my happiness", but about test suite robustness:

We are working on a portable implementation where tasks submitted to the streams might or might not run in parallel and might take more or less time depending where we run the app and what is the utilization by other workloads etc. If we rely on timing, it will be flaky.

Of course I'm not against making things run as efficiently as possible in parallel, I just want to make sure there are no tests that pass on a platform A, but not on platform B even though platform B's execution order adheres to the CUDA spec wording (which the programmers should also only look at).

pvelesko commented 2 years ago

I was kidding, Pekka, I understand your concern. It also has become evident to me that we are interpreting the API spec differently.

streams might or might not run in parallel

You keep coming back to the word "may". From my understanding: things MAY execute in parallel if there is enough system parallelism available. So.. if we are running a unit test where we only have these 2 streams that we just created it MUST execute in parallel. I can't think of a HW platform that would be physically unable to execute two addOne operations in parallel. Every platform that we could reasonably target is multi-core. Of course I understand that you have much more experience in portable computation. Perhaps an example is all that it would take to convince me otherwise?

utilization by other workloads

Running other workloads while doing unit testing of a programming framework it not something I would ever consider doing - that's just asking for trouble that's why I comfortable making the assumption that the system has enough availability to run at least 2-3 streams in parallel at any time.

If we rely on timing, it will be flaky.

I agree that relying on kernels that run for a "long time" and "not that long" is flaky. If we had a sleep that we would call inside the kernel then that would be not flaky at all. Again, I'm open to suggestions on alternatives approaches (requires us to agree on the spec interpretation first, however)

Of course I'm not against making things run as efficiently as possible in parallel

This is not about efficiency or performance. This is about correctness and adhering to the specification. We must prove that things indeed MAY run in parallel on independent streams.

pass on a platform A, but not on platform B even though platform B's execution order adheres to the CUDA spec

Could you provide an example? Perhaps all this confusion comes down to my lack of understanding of different execution models across a wide variety of HW out there.

pjaaskel commented 2 years ago

In Finland we have a relatively new saying "läpällä ja kännissä" (means: "I only joked and I was drunk when I said it") which can be officially used to cancel whatever was said before. (I'm joking here now myself ;-) )

Yes, I think I now understood the confusion here: It comes from the fact that I've stared specification documents which actually define these terms in lawyer talk before they define the APIs and I see it this way: "May" in specifications typically means "it's allowed to do something" (but not forced, i.e., "must"). Yes, in practice if you have parallel cores it "likely" will execute in parallel, but not "must" in specification terms (it means "has to"). Sounds like nitpicking, but it's an important difference when implementing a spec.

A simple example where things might fail to execute in parallel is an overloaded processor where the cores gets competition from other apps and the host thread doesn't proceed fast enough to the point where you assert. Of course the likelihood of this can be made infinitely small by making the background thread execution time longer and by not allowing to do anything else with the PC when running our test suite, but the possibility is still there.

Another example is a (in this case OpenCL or LZ) driver where there is no concurrent execution of commands implemented for a reason or another and thus essentially does what I did in my brute force workaround initially and execute all commands in the order like they were blocking.

One more example is the amount of parallel resources and the support for concurrent kernel execution: If you write a program timing-assuming, say, 4 compute units/cores execute kernels concurrently and you run it in a platform that doesn't support concurrent kernel execution or just has 2 cores, the outcome can be whatever. Please bear in mind that we are targeting whatever targets OpenCL can support, not only CUDA targets - there's a large variety of devices there.

This is not about efficiency or performance. This is about correctness and adhering to the specification.

If the specs says "you may", and even if "we won't", we are still functionally adhering to the specs in terms of correctness.

We must prove that things indeed MAY run in parallel on independent streams.

I totally agree with this part. We should also test and ensure that we are able to utilize parallelism when it's defined by the async streams in the program to ensure we don't synchronize needlessly in the runtime, but that's about performance optimization already. Just putting it in the suite of tests that are supposed to pass always, no matter what device and driver we are targeting, I'm against.

Also, the ticket's original test case's failure seems suspicious to me. Why a deadlock happens here?

pvelesko commented 2 years ago

"läpällä ja kännissä"

I need to learn how to pronounce this and then I would incorporate it into my vernacular. A great commit message for reverting commits also.

Of course the likelihood of this can be made infinitely small by making the background thread execution time longer and by not allowing to do anything else with the PC when running our test suite, but the possibility is still there.

Right, I was operating in this infinitely small realm.

Another example is a (in this case OpenCL or LZ) driver where there is no concurrent execution of commands implemented for a reason or another

If my parallel programming language driver can't run things in parallel then I wouldn't expect to get 100% pass rate on unit tests.

there's a large variety of devices there

Yes but we need to put some limits on what kind of devices we reasonably expect to target, no? This is not DOOM haha

Just putting it in the suite of tests that are supposed to pass always, no matter what device and driver we are targeting, I'm against.

I can agree with this part as well. We can have a test target that must pass on everything and another parallel_tests target where things must pass as long as you're using a major vendor GPU or something?

All in all, I think we've finally cleared this up!

Also, the ticket's original test case's failure seems suspicious to me. Why a deadlock happens here?

Because upon exit L0 expects all events to get collected and will wait until this happens. If an event does not get collected that means there is a problem with reference counting.

There was a bug in the reference counting I fixed it and the program exited normally. Previously I would print warnings but I wanted things to fail in such an annoying way that if there was an issue with reference counting that it would get reported. Hangs seem to work well for this.

pvelesko commented 2 years ago

Fixed in Release-0.9

pjaaskel commented 2 years ago

"läpällä ja kännissä"

I need to learn how to pronounce this and then I would incorporate it into my vernacular. A great commit message for reverting commits also.

:) I can teach you it in our next telco.

there's a large variety of devices there

Yes but we need to put some limits on what kind of devices we reasonably expect to target, no? This is not DOOM haha

I wouldn't put limits too easily. If the target is added portability we should aim towards whatever OpenCL can support. CUDA/HIP is used also in mobile/low end devices (Tegra etc.).

Just putting it in the suite of tests that are supposed to pass always, no matter what device and driver we are targeting, I'm against.

I can agree with this part as well. We can have a test target that must pass on everything and another parallel_tests target where things must pass as long as you're using a major vendor GPU or something?

Sounds good to me!

All in all, I think we've finally cleared this up!

:+1:

Also, the ticket's original test case's failure seems suspicious to me. Why a deadlock happens here?

Because upon exit L0 expects all events to get collected and will wait until this happens. If an event does not get collected that means there is a problem with reference counting.

There was a bug in the reference counting I fixed it and the program exited normally. Previously I would print warnings but I wanted things to fail in such an annoying way that if there was an issue with reference counting that it would get reported. Hangs seem to work well for this.

OK. I created a milestone for 0.9.1 which we can release soon after tagging 0.9 later this afternoon, pehaps we can move this one there also? We've made some promises in milestone reports etc. so let's just push 0.9 out, @franz will manage it until the end.

pjaaskel commented 2 years ago

(Ah sorry, missed that you pushed it already to Release-0.9)

CHIP-SPV / chipStar

Test is hung for indefinitely. #169

0 0x00007ffff6a13cc1 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6

1 0x00007ffff6a199c3 in nanosleep () from /lib64/libc.so.6

2 0x00007ffff6a46144 in usleep () from /lib64/libc.so.6

3 0x00007ffff7b86972 in CHIPCallbackEventMonitorLevel0::monitor (this=0x1083290) at /home/sarbojit/src/chip-spv/src/backend/Level0/CHIPBackendLevel0.cc:510

4 0x00007ffff7b7b047 in CHIPEventMonitor::monitorWrapper (Arg=0x0) at /home/sarbojit/src/chip-spv/src/CHIPBackend.hh:319

5 0x00007ffff671f6ea in start_thread () from /lib64/libpthread.so.0

6 0x00007ffff6a4fa6f in clone () from /lib64/libc.so.6

0 0x00007ffff6a13cc1 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6 [0/67]

1 0x00007ffff6a199c3 in nanosleep () from /lib64/libc.so.6

2 0x00007ffff6a46144 in usleep () from /lib64/libc.so.6

3 0x00007ffff7b86cfa in CHIPStaleEventMonitorLevel0::monitor (this=0x10830d0) at /home/sarbojit/src/chip-spv/src/backend/Level0/CHIPBackendLevel0.cc:561

4 0x00007ffff7b7b047 in CHIPEventMonitor::monitorWrapper (Arg=0x0) at /home/sarbojit/src/chip-spv/src/CHIPBackend.hh:319

5 0x00007ffff671f6ea in start_thread () from /lib64/libpthread.so.0

6 0x00007ffff6a4fa6f in clone () from /lib64/libc.so.6

0 0x00007ffff6a13cc1 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6

1 0x00007ffff6a199c3 in nanosleep () from /lib64/libc.so.6

2 0x00007ffff5a0c8a5 in ?? () from /soft/restricted/CNDA/emb/libraries/intel-level-zero/compute-runtime/20220711.1/lib64/libze_intel_gpu.so.1

3 0x00007ffff5a0c910 in ?? () from /soft/restricted/CNDA/emb/libraries/intel-level-zero/compute-runtime/20220711.1/lib64/libze_intel_gpu.so.1

4 0x00007ffff671f6ea in start_thread () from /lib64/libpthread.so.0

5 0x00007ffff6a4fa6f in clone () from /lib64/libc.so.6

0 hipDeviceSynchronize () at /home/sarbojit/src/chip-spv/src/CHIPBindings.cc:785

1 0x000000000040115f in main ()