Ability to compile an `export proc` containing GPU code into a C-callable library via `--library`

bradcray commented 6 hours ago

In a few conversations recently, potential end-user use cases have come up in which they would like to write a GPU kernel in Chapel as a single-locale procedure (containing a GPU on-block, e.g), export that procedure, compile it into a library .so or .a, and call it from a C program. This issue asks for the ability to do such a thing, assuming it does not work today, or for a test/example demonstrating that it works and locking that in if it does.

bradcray commented 6 hours ago

Sorry that I'm not filing this with a better first-hand knowledge of what the status is. Asking @jabraham17 about https://github.com/chapel-lang/chapel/pull/25538 made it sound like there were other problems beyond that fix… I'm just not sure what they are and haven't had the chance to try it yet.

I didn't tag this with "user issue" because these are only potential user use cases of Chapel, not coming from existing users (and with no guarantee that if this was supported, they would use Chapel). That said, it feels like a common case for people who want to program GPUs from within existing code more easily to want, and therefore attractive to support.

e-kayrakli commented 5 hours ago

https://github.com/chapel-lang/chapel/issues/25224 is the main issue blocking this as far as I know.

Using sync begin on instead of on is the workaround for the time being.

bradcray commented 5 hours ago

Thanks for pointing back to that issue, which I'd forgotten about. If that workaround permits it to be done, I think that'd be acceptable (and worth locking a GPU test in for, unless that's not possible for some reason). It's a little regrettable in terms of code cleanliness but better than not being able to do it at all.

jabraham17 commented 5 hours ago

There is a test locking it in, test/gpu/native/interop/gpuLibrary/udf2.chpl, which is just a simple version of the code from https://github.com/chapel-lang/chapel/issues/25224.

bradcray commented 5 hours ago

Oh, awesome, thank you! I hadn't noticed that in that PR. I'm going to close this as a result since this checks the box that I was looking for and https://github.com/chapel-lang/chapel/issues/25224 covers the clean-up.

e-kayrakli commented 4 hours ago

It's a little regrettable in terms of code cleanliness but better than not being able to do it at all.

Each time I think about this issue, I come to different conclusions. Today, I disagree that it is regrettable. But please check again tomorrow :)

Here is why:

If I am calling Chapel from some other language/application, I own my threads and processes not Chapel. So, fundamentally, Chapel shouldn't be able to modify it. In this case on here.gpus[0] would be modifying the task as it will move it to a GPU sublocale. I believe this is the reason why our current implementation is lacking today. Or at least, this is the thinking that led to the current implementation. With begin you are asking to create a new Chapel task that Chapel owns. As such, you can move it in Chapel. Whether on statement is modifying the task/thread is up for debate, of course.

From a completely practical standpoint, I do find it unfortunate that simple ons don't work in --library mode as that would mean I can't write code that can be used both as a library and a standalone applicaiton.

bradcray commented 4 hours ago

At the times you don't find it regrettable, do you also not consider the use of begin to be a workaround?

My first thought upon re-reading that issue was "Maybe all export routines should be run within a Chapel Qthread rather than the user's original thread (creating a new one if it isn't already in one)?" That seems like it would give it full Chapel capabilities, and it doesn't seem inappropriate (it is a Chapel computation, after all).

The main downside I can think of is that if I implemented a very trivial sequential routine in Chapel—say a new faster sin() implementation—would it introduce too much overhead? But maybe that's a case we should consider as an optimization—optimizing away the need to use a Chapel Qthread in the event that the export routine is sufficiently simple?

Would there be other downsides to that approach? Are there things I can do in Chapel that I would prefer to execute in a C program's thread rather than a Qthread?

e-kayrakli commented 4 hours ago

At the times you don't find it regrettable, do you also not consider the use of begin to be a workaround?

I do not. sync begin on reads like the natural thing to write ("Create a new Chapel task, move it somewhere else. I'll be waiting for you when you are done")

Would there be other downsides to that approach? Are there things I can do in Chapel that I would prefer to execute in a C program's thread rather than a Qthread?

Certainly interesting to consider. Most HPC libraries have an initialization function. I believe we do have something like that as well. Couldn't that function create a Chapel task (or "shepherd" if that's the right Qthreads concept) and put it to sleep until the first Chapel function is called? That might alleviate some of the worries you have, though would still have some overhead for things like very simple operations.

That being said, if you are using Chapel from another application, one would think that that Chapel code has some parallel constructs that are heavierweight than waking up a sleeping Chapel task.

chapel-lang / chapel

Ability to compile an `export proc` containing GPU code into a C-callable library via `--library` #25975