CHIP-SPV / chipStar

chipStar is a tool for compiling and running HIP/CUDA on SPIR-V via OpenCL or Level Zero APIs.
Other
157 stars 27 forks source link

Fix hipBLAS sync #838

Closed pvelesko closed 1 month ago

pvelesko commented 2 months ago

Fixes #836

pvelesko commented 2 months ago

I am not able to understand the reason we need synchronous queue. May I know the issue you are addressing with this change? With limited description in https://github.com/CHIP-SPV/chipStar/issues/836 it is hard to understand the actual issue.

In hipBLAS, you explicitly create an in_order queue from the native handles:

            if (isImmCmdList) {
                ctxt->queue = sycl::ext::oneapi::level_zero::make_queue(ctxt->context, ctxt->device, (pi_native_handle)hCommandList, true, 1, sycl::property::queue::in_order());
            } else {
                ctxt->queue = sycl::ext::oneapi::level_zero::make_queue(ctxt->context, ctxt->device, (pi_native_handle)hQueue, false, 1, sycl::property::queue::in_order());
            }

This translates to SYNCHRONOUS queue in Level Zero - you can't mix them. So that forces us to use SYNCHRONOUS queue on our end.

Since we use ASYNCHRONOUS queues in Level Zero backend, that means we have to create a separate queue just for interop.

If you were to switch to out of order queues, we would need to manage the ordering of operations ourselves using events. That means, you would need to implement event ordering in hipBLAS, and we would require to get a handle to the last event returned from hipBLAS so that we can use it for sync.

I think using a separate sync queue for Level Zero is the best solution right now due to the complexities involved in out of order.

Remember making chipStar queue synchronous may not be very helpful.

If you were to implement out of order queues in hipBLAS, we can support that.

Sarbojit2019 commented 2 months ago

@pvelesko, Thanks for clarification.

Some thoughts ....

Anyway, my take is any synchronization issue should be handle inside hipBLAS or any client library/app. chipStar runtime need not to handle it. Like in this case if you see 'asum' is failing then hipBLAS should handle it.

pvelesko commented 2 months ago

I think having separate API to get synchronous queue would be more futuristic than assuming all clients would need synchronous queue.

Hmm sure we can add that.

Ideally hipBLAS need not to submit anything via chipStar queue hence I did not create it as in-order. I know there are some WAs, few I have disabled if not all.

My understanding was that the internal MKL command list is initialized with the handles provided - that's why the handles are needed in the first place. If MKL is not using the command queue/command list provided, then why are we passing in the handles? Context should be enough.

How are we going to handle non-immediate command list cases or OpenCL cases?

Can you elaborate?

I think with current approach applications will have performance impact as chipStar is going to add extra barrier with the submission even when call is not coming from the library.

That's true but I can't think of anything else that can be done in chipStar side.

Anyway, my take is any synchronization issue should be handle inside hipBLAS or any client library/app. chipStar runtime need not to handle it. Like in this case if you see 'asum' is failing then hipBLAS should handle it.

Yes, I think this should be handled in hipBLAS by adding a barrier after each call but until that work is done, chipStar needs to handle the current hipBLAS as it stands which this PR is intended to do.