[DR] Twist-shift boundary conditions

manauref commented 5 months ago

[UPDATE]: see latest design in my comment on July 31st below.

These boundary conditions are meant for 3D gyrokinetics (though in fact they were also applied to fluids in other codes), and consist of

f(x, y, z_max) = f(x,y-S,z_min)

or viceversa

f(x, y, z_min) = f(x,y+S,z_max)

described, along with its DG algorithm, in M. Francisquez, et al. CPC 298 (2024) 109109 (see eqn 9).

As given in eqn 30 of that paper, the distribution in a the target cell (i,jtar) is given by a linear combination of mat-vec multiplies including contributions from N{do}(i) donor cells

f_{i,j_tar} = sum_{q=1}^{N_{do}(i)} M_q f_{do,i,j_q}

So implementing this BC consists of two things:

Constructing the M_q matrices (during initialization).
Performing these matrix vector multiplies, and adding their contributions.

We intend to do this building up the capabilities of zero/bc_twistshift.c, which was created last year just to do 2, assuming 1 was done in g2. We'll mention some initial design intentions about each of these below, and edit it as things evolve:

Constructing M_q matrices

We'll store a flat list of donor matrices, created with

  up->matsdo_ho = gkyl_nmat_new(total_donor_mats, basis->num_basis, basis->num_basis);

for which we'll need to know the total number of donors (matrices or cells, total_donor_mats). This is computed after having a (flat) list of the donor cell indices (cellsDo in g2, currently cells_do in g0). To be precise, cells_do, is a flattened 4D array: for each (i,j) location we store the x-y indices of N_do(i) donors.

After knowing who the donors are for each target, we proceed to populate their elements. This involves a sequence of sub-cell integrals, which we have to craft based on a number of 1D projections (onto a DG basis), root finding, and calling 1 of 3 kernels. In g2 this part of the code made heavy use of Lua tables to store scalars, arrays, functions and maybe even strings; we'll have to figure out how to write the C equivalent without such tables, though in some places we may replace similar those data objects with C structs.

The matrices are computed on the host, and are later copied to the GPU with

  gkyl_nmat_copy(up->matsdo, up->matsdo_ho);

Performing mat-vec multiplies and adding their contributions

Knowing who the donors are, and the map from target cell index to linear index into matsdo, we can easily perform these multiplications (i.e. w/ gkyl_nmat_mv) on the host, in serial, and add their contributions together.

On the GPU this is a little tricker. Part of the challenge is that we call gkyl_nmat_mv, which calls cuBLAS and this is not meant to be called in a kernel.

Our first attempt didn't parallelize over y, or vpar-mu & I think only parallelized over x and N_do(i) (but actually, I think the assignment of the vectors is not parallelized over x). This we found to be slow.
To expose greater parallelism we could:
- Parallelize over vpar and mu. Conceptually this should be easy, as the BC has not dependency on vpar,mu.
- Parallelize over y. This is trickier since the donors vary with y. Needless to say, the GPU parallelizes over basis functions by virtue of cuBLAS parallelizing each mat-vec multiply.

In the current approach the mat-vec multiplies are being done with

    gkyl_nmat_mv(1.0, 0.0, GKYL_NO_TRANS, up->matsdo, up->vecsdo, up->vecstar);

and then, the reduction and assignment into the solution vector, are calling gkyl_bc_twistshift_inc_cu in each x cell (so this step is not parallelized over x).

To expose greater parallelism we could do this (a private comment from last year):

i was at first reluctant to batch over vpar,mu also because we'd have to create too many matrices and vectors. This isn't necessarily true. Matrices only depend on x and donorIdx, so it's natural to batch over Nx and ndonor. But when batching over vpar,mu one can create an array of pointers to matrices, where there are only ndonor*Nx matrices allocated, but ndonor*Nx*Nvpar*Nmu pointers (i.e. the extra pointers in Nvpar*Nmu space simply point back to the same Nx*ndonor matrices, without having to allocate more matrices).
The vectors however do have to be allocated, because they need to hold the distinct value of f in those (donorIdx, xIdx, vparIdx, muIdx) cells. So the memory needed to batch over Nvpar*Nmu is only Nvpar*Nmu times greater than what we use now.
An estimate for a 128x96x16x32x12 grid with 15 donor cells at each x (48 DG coefficients):
Memory for vectors, batching over Nx*ndonor = 15*128*48*8 = 0.73 MB
Memory for vectors, batching over Nx*ndonor*Nvpar*Nmu = 15*128*32*12*48*8 = 283 MB
The latter is big, but it's not terrible. It's 1536 times smaller than a 5D distribution function. Furthermore, when we decompose in x and y these numbers will be smaller still.

manauref commented 5 months ago

Some potentially useful info: I noticed that the gkyl_nmat_mv, which calls cu_nmat_mm has

        cublasHandle_t cuh;
        cublasCreate_v2(&cuh);
...
  // Now do the strided batched multiply
  cublasStatus_t info;
        info = cublasDgemmStridedBatched(cuh, transa, transb, C->nr, C->nc, k, &alpha, A->data, lda, sza.nr*sza.nc,
                B->data, ldb, szb.nr*szb.nc, &beta, C->data, ldc, szc.nr*szc.nc, A->num);
  cublasDestroy(cuh);

and our hackathon mentors/cuBLAS devs said that handle creation is very expensive, and that they strongly recommended maintaining the same handle. We ought to explore this as a later-stage optimization.

akashukla commented 5 months ago

Initial inspection of the code reveals that much of the advance method can be done at t=0. The initial phase of prototyping has been focused on moving these things out of the advance method in order to make it clearer how we can parallelize the steps that actually need to happen in the advance method.

manauref commented 3 months ago

A new prototype of the TS BC has been implemented in the gk-g0-app_tsnew-prototyping branch.

Instead of using mat-vec multiplies as in eqn 30 of M. Francisquez, et al. CPC 298 (2024) 109109, we have instead used mat-mat multiplies to perform this operation (for 3x2v)

ftar_{i,j_tar,k,m,n} = (sum_{q=1}^{num_do_tot} ( scimat_q fmat_q )_{n,s}

where

num_do_tot is the total number of donors for a given y-vpar-mu index (i.e. the number of donors at all x added up).
ftar_{i,j_tar,k,m,n} is the n-th DG coefficient of the target distribution in the i,j_tar,k,m (x,y,vpar,mu) cell.
scimat is a num_basis x num_basis matrix with the appropriate subcell integral contribution.
fmat is a num_basis x (Ny Nvpar Nmu) matrix, with the DG coefficients of the donor distributions arranged as rows of the matrix and the various y-vpar-mu cells arranged as columns
the subscript s is used to indicate a linear index running over the x-y-vpar-mu dimensions.

Some comments on the construction of the scimat matrices (formerly M_q) and the mat-mat multiples:

Constructing `scimat` matrices

We find the donor cells using the same algorithm described in the paper and used in g2. This uses, amongst other things, the function gkyl_rect_grid_find_cell that @jRoeltgen implemented in g0 (modeled after the findCell method in g2's RectCart). One would think that gkyl_rect_grid_coord_idx, which is much simpler, would work; but the former better handles floating point precision issues around the periodic boundary. Perhaps this should be in the TS updater rather than in the grid object.

The scimat matrices however are now computed a little differently. We realized that figure 4 of the paper is not accurate because the spacing between the blue lines should always be Delta y. Using that fact, we simplified the calculation of these subcell integrals in two ways:

They now all use variable y limits, and no variable-x limit integrals are used.
They form some integrals out of two smaller, side-by-side (left-right portions) integrals. Unlike, for example what is described in figure B.19, we now accurately draw this scenario as follows (courtesy of @tnbernard) and we build this subcell integral out of the two halfs separated by the dashed grey line.

The first of these simplifies some integrals quite a bit, because variable x-limit integrals needed inversion of the y_{j_tar+/-1/2}-S(x) functions, which had more tunable parameters and room for less robustness. The second of these simply reduces the type of code/algorithms needed (it may not reduce the amount of code as the present code still has a fair bit of repetition). Despite these changes, I've been able to reproduce figures 6, 9 and 11, and by checking conservation at different resolutions i've also confirmed what's in figure 16a.

The matrices are computed on the host, and are later copied to the GPU with

  gkyl_nmat_copy(up->matsdo, up->matsdo_ho);

Performing mat-mat multiplies and adding their contributions

Having constructed the scimat matrices, the application of the BC consists of

Assigning the columns of the fmat matrices with entries from the donor distribution.
Calling gkyl_nmat_mm to perform num_do_tot mat-mat multiplications.
At the i-th cell along x, adding up the contribution of num_do(i) mat-mat multiplications, and placing that in the target distribution.

Step 1 uses an index map from the donor distribution to the columns of the fmat matrices (num_numcol_fidx_do). Step 2 uses a black-box N mat-mat multiplication function. And step 3 uses an index map from the result of a mat-mat multiplication, to the target distribution (num_numcol_fidx_tar).

Possible improvements

Presently the BC is applied using volume expansions. We could instead evaluate f at the surface and use surface expansions. This would reduce the number of rows in the matrices by 2X, and thus would reduce the number of FLOPS by at least 4X I think (mat-mat multiplication is an N^2.4 - N^3 operation).
I believe gkyl_nmat_mm is still creating a CUBLAS handle every time. We need to change it so it only creates it once.
We are allocating quite a bit of memory to create fmat, i.e. an num_do_totNyNvparNmuncomp chunk of memory, which is somewhere between a 4D and a 5D distribution. It might be advantageous to find a away to reduce this memory need.
The construction of scimat has a bit of repetition now that all scenarios are treated with variable y limits. We should look into writing this even more modularly with less repetition. But we may also not want to do this until we've done some simulations and know the present approach is robust.
If after having done some simulations we find that the current implementation is robust enough, we could remove the 2 unused kernels and simplify some of the kernel-related code.

tnbernard commented 3 months ago

I support finding a way to reduce the memory, since @manauref mentioned in the GK meeting today that high resolution LAPD simulations are approaching a memory limit. This will undoubtedly be exacerbated in shaped geometry and if neutrals/impurities are included.

Which 2 kernels are not used that you reference in improvement no. 5? Do you mean that at the sub-cell integral calculation, where 3 kernels were needed, now only one is necessary?

manauref commented 2 months ago

Reducing the memory footprint of TS BCs will not improve our ability to run bigger (clopen) sims on GPUs, because it's the difference between ~5 distributions or 6 (so only a 16.67% footprint). What we really need is parallelization in other dimensions, or a refactoring of many other parts of the code so we allocate fewer arrays.

In g2 there were 3 kernels:

subcell integrals with variable x limits
subcell integrals with variable y limits
integral over the whole cell we only use one of these now, so we could remove the other 2.

tnbernard commented 2 months ago

Thanks for clarifying.

ammarhakim commented 2 months ago

Closed as this work is complete. Nice job, @manauref!

ammarhakim / gkylzero