bespoke-silicon-group / bsg_manycore

Tile based architecture designed for computing efficiency, scalability and generality
Other
221 stars 58 forks source link

Multiple Tile Groups #665

Closed natewise closed 1 year ago

natewise commented 1 year ago

So I'm currently exploring CUDA and I'm looking to run a CUDA program with multiple kernels (this is a bit of an addendum from some of the answers I received in issue #663). It was mentioned to me that kernels should be run on a tile group, so a multi-kernel program should have several tile groups so the kernels would have access to shared memory and what not. I haven't been able to find out how to make several tile groups within a single tile array, and looking into it, many of the headers at least don't appear to support doing this. bsg_set_tile_x_y.c seems to be setup to automatically use 1 tile group (lines 36-38). Also, bsg_tile_group_barrier also seems to only allow for one tile group, because the BSG_TILE_GROUP_X/Y_DIM variables are set once per include. So the definition of the _bsg_rowbarrier and _bsg_colbarrier structs will be set with a done array of length BSG_TILE_GROUP_X/Y_DIM. Is this correct? Any tips for getting multiple kernels and/or multiple tile groups set up, or maybe a CUDA example that already does this? Thanks!

mrutt92 commented 1 year ago

Hi.

I haven't been able to find out how to make several tile groups within a single tile array, and looking into it, many of the headers at least don't appear to support doing this. bsg_set_tile_x_y.c

The host sets the id variables when using CUDA. The routine in this file is unused in which tiles set their own ids is unused for CUDA kernels.

Any tips for getting multiple kernels and/or multiple tile groups set up, or maybe a CUDA example that already does this? Thanks!

This is an example of launching one kernel with multiple tile groups. The key is the grid_dim variable to kernel_enqueue(). It launches however many tile groups with the dimension given as tg_dim. https://github.com/bespoke-silicon-group/bsg_replicant/blob/main/examples/cuda/test_vec_add_parallel/main.c

Note that the barrier APIs in CUDA are tile group barriers only. The only API we currently support for syncing across groups is through the host with tile_groups_execute().

Any tips for getting multiple kernels and/or multiple tile groups set up, or maybe a CUDA example that already does this? Thanks!

We admittedly do not have a good example of launching multiple kernels at once, but this would be done by calling kernel_enqueue() multiple times before calling tile_groups_execute(). Note that there is no guarantee of ordering between kernel invocations.

Also, bsg_tile_group_barrier also seems to only allow for one tile group, because the BSG_TILE_GROUP_X/Y_DIM variables are set once per include. So the definition of the bsg_row_barrier and bsg_col_barrier structs will be set with a done array of length BSG_TILE_GROUP_X/Y_DIM.

What i think you mean here is that it only supports one tile group size, you can have multiple tile groups of the same size using grid_dim and tg_dim as mentioned above.

You can do multiple tile group sizes by compiling your kernels as separate objects and setting these macros to the desired values individually.

This is not directly related to your question but there are two barrier alternatives including a hw accelerated one. https://github.com/bespoke-silicon-group/bsg_manycore/blob/master/software/bsg_manycore_lib/bsg_cuda_lite_barrier.h