Closed manauref closed 2 months ago
Some potentially useful info:
I noticed that the gkyl_nmat_mv, which calls cu_nmat_mm
has
cublasHandle_t cuh;
cublasCreate_v2(&cuh);
...
// Now do the strided batched multiply
cublasStatus_t info;
info = cublasDgemmStridedBatched(cuh, transa, transb, C->nr, C->nc, k, &alpha, A->data, lda, sza.nr*sza.nc,
B->data, ldb, szb.nr*szb.nc, &beta, C->data, ldc, szc.nr*szc.nc, A->num);
cublasDestroy(cuh);
and our hackathon mentors/cuBLAS devs said that handle creation is very expensive, and that they strongly recommended maintaining the same handle. We ought to explore this as a later-stage optimization.
Initial inspection of the code reveals that much of the advance method can be done at t=0. The initial phase of prototyping has been focused on moving these things out of the advance method in order to make it clearer how we can parallelize the steps that actually need to happen in the advance method.
A new prototype of the TS BC has been implemented in the gk-g0-app_tsnew-prototyping branch.
Instead of using mat-vec multiplies as in eqn 30 of M. Francisquez, et al. CPC 298 (2024) 109109, we have instead used mat-mat multiplies to perform this operation (for 3x2v)
ftar_{i,j_tar,k,m,n} = (sum_{q=1}^{num_do_tot} ( scimat_q fmat_q )_{n,s}
where
num_do_tot
is the total number of donors for a given y-vpar-mu index (i.e. the number of donors at all x added up).ftar_{i,j_tar,k,m,n}
is the n-th DG coefficient of the target distribution in the i,j_tar,k,m
(x,y,vpar,mu) cell.scimat
is a num_basis x num_basis matrix with the appropriate subcell integral contribution.fmat
is a num_basis x (Ny Nvpar Nmu) matrix, with the DG coefficients of the donor distributions arranged as rows of the matrix and the various y-vpar-mu cells arranged as columnss
is used to indicate a linear index running over the x-y-vpar-mu dimensions.Some comments on the construction of the scimat
matrices (formerly M_q) and the mat-mat multiples:
scimat
matricesWe find the donor cells using the same algorithm described in the paper and used in g2. This uses, amongst other things, the function gkyl_rect_grid_find_cell
that @jRoeltgen implemented in g0 (modeled after the findCell method in g2's RectCart). One would think that gkyl_rect_grid_coord_idx
, which is much simpler, would work; but the former better handles floating point precision issues around the periodic boundary. Perhaps this should be in the TS updater rather than in the grid object.
The scimat
matrices however are now computed a little differently. We realized that figure 4 of the paper is not accurate because the spacing between the blue lines should always be Delta y. Using that fact, we simplified the calculation of these subcell integrals in two ways:
The first of these simplifies some integrals quite a bit, because variable x-limit integrals needed inversion of the y_{j_tar+/-1/2}-S(x) functions, which had more tunable parameters and room for less robustness. The second of these simply reduces the type of code/algorithms needed (it may not reduce the amount of code as the present code still has a fair bit of repetition). Despite these changes, I've been able to reproduce figures 6, 9 and 11, and by checking conservation at different resolutions i've also confirmed what's in figure 16a.
The matrices are computed on the host, and are later copied to the GPU with
gkyl_nmat_copy(up->matsdo, up->matsdo_ho);
Having constructed the scimat
matrices, the application of the BC consists of
fmat
matrices with entries from the donor distribution.gkyl_nmat_mm
to perform num_do_tot
mat-mat multiplications.num_do(i)
mat-mat multiplications, and placing that in the target distribution.Step 1 uses an index map from the donor distribution to the columns of the fmat
matrices (num_numcol_fidx_do
). Step 2 uses a black-box N mat-mat multiplication function. And step 3 uses an index map from the result of a mat-mat multiplication, to the target distribution (num_numcol_fidx_tar
).
gkyl_nmat_mm
is still creating a CUBLAS handle every time. We need to change it so it only creates it once.fmat
, i.e. an num_do_totNyNvparNmuncomp chunk of memory, which is somewhere between a 4D and a 5D distribution. It might be advantageous to find a away to reduce this memory need.scimat
has a bit of repetition now that all scenarios are treated with variable y limits. We should look into writing this even more modularly with less repetition. But we may also not want to do this until we've done some simulations and know the present approach is robust.I support finding a way to reduce the memory, since @manauref mentioned in the GK meeting today that high resolution LAPD simulations are approaching a memory limit. This will undoubtedly be exacerbated in shaped geometry and if neutrals/impurities are included.
Which 2 kernels are not used that you reference in improvement no. 5? Do you mean that at the sub-cell integral calculation, where 3 kernels were needed, now only one is necessary?
Reducing the memory footprint of TS BCs will not improve our ability to run bigger (clopen) sims on GPUs, because it's the difference between ~5 distributions or 6 (so only a 16.67% footprint). What we really need is parallelization in other dimensions, or a refactoring of many other parts of the code so we allocate fewer arrays.
In g2 there were 3 kernels:
Thanks for clarifying.
Closed as this work is complete. Nice job, @manauref!
[UPDATE]: see latest design in my comment on July 31st below.
These boundary conditions are meant for 3D gyrokinetics (though in fact they were also applied to fluids in other codes), and consist of
or viceversa
described, along with its DG algorithm, in M. Francisquez, et al. CPC 298 (2024) 109109 (see eqn 9).
As given in eqn 30 of that paper, the distribution in a the target cell (i,jtar) is given by a linear combination of mat-vec multiplies including contributions from N{do}(i) donor cells
So implementing this BC consists of two things:
We intend to do this building up the capabilities of zero/bc_twistshift.c, which was created last year just to do 2, assuming 1 was done in g2. We'll mention some initial design intentions about each of these below, and edit it as things evolve:
Constructing M_q matrices
We'll store a flat list of donor matrices, created with
for which we'll need to know the total number of donors (matrices or cells,
total_donor_mats
). This is computed after having a (flat) list of the donor cell indices (cellsDo
in g2, currentlycells_do
in g0). To be precise,cells_do
, is a flattened 4D array: for each (i,j) location we store the x-y indices of N_do(i) donors.After knowing who the donors are for each target, we proceed to populate their elements. This involves a sequence of sub-cell integrals, which we have to craft based on a number of 1D projections (onto a DG basis), root finding, and calling 1 of 3 kernels. In g2 this part of the code made heavy use of Lua tables to store scalars, arrays, functions and maybe even strings; we'll have to figure out how to write the C equivalent without such tables, though in some places we may replace similar those data objects with C structs.
The matrices are computed on the host, and are later copied to the GPU with
Performing mat-vec multiplies and adding their contributions
Knowing who the donors are, and the map from target cell index to linear index into
matsdo
, we can easily perform these multiplications (i.e. w/gkyl_nmat_mv
) on the host, in serial, and add their contributions together.On the GPU this is a little tricker. Part of the challenge is that we call
gkyl_nmat_mv
, which calls cuBLAS and this is not meant to be called in a kernel.In the current approach the mat-vec multiplies are being done with
and then, the reduction and assignment into the solution vector, are calling
gkyl_bc_twistshift_inc_cu
in each x cell (so this step is not parallelized over x).To expose greater parallelism we could do this (a private comment from last year):