ammarhakim / gkylzero

Lowest, compiled layer of Gkeyll infrastructure.
MIT License
19 stars 3 forks source link

[DR] Field Solve for Multi-block Simulations (Blocks Connected in Parallel Direction) #407

Closed akashukla closed 2 days ago

akashukla commented 1 week ago

For multi-block gyrokinetic simulations we need special considerations for the poisson solve. Our GK poisson solve has 2 steps: A smoothing in the parallel direction and then a perpendicular solve on each of the perpendicular planes.

For multi-block simulations, we will need to modify the solver to accommodate blocks connected in the parallel direction and also blocks connected in the perpendicular direction. As of now, this DR only considers blocks connected in the parallel direction.

In order to do the parallel smoothing, we need to collect the charge density from all of the blocks into a single array. Each block's charge density must be placed in the right location of this array based on the connectivity of the blocks.

In a multi-block simulation in which multiple blocks may be owned by a single rank or one block may contain multiple ranks.

Here is overview of the steps that need to be taken in the advance method:

  1. Each rank calculates its own local charge density (each block may have several cuts/ranks or one rank may own several blocks)
  2. Gather the charge density within each block
  3. Copy the charge density from each block into the appropriate location of a charge density array that spans all of the blocks connected along z. I will call this the "crossz" charge density for convenience.
  4. Communicate the crossz charge density to all of the ranks participating. This is the most complicated step.
  5. Perform the parallel smoothing operation on the crossz charge density.
  6. Copy the appropriate portion of the smooth solution back into each block.
  7. Solve the perpendicular problem for the potential, phi, in each block

The steps in the rhs method requires us to compute some things ahead of time at t=0:

  1. For each block, we must trace the block connectivity to see which other blocks it is connected to along z. From this we can construct a "crossz" range that spans all of the blocks connected along z. This is the range on which the crossz charge density introduced in step 3 above is allocated.
  2. For each cut, we need to know its range and how this fits into the crossz range. We call these the "cut_ranges." For each cut we also need to know and store the rank that owns that cut. These ranges and ranks are required so we can broadcast the data from the right ranks to the right locations in step 4 above.
  3. For each block, we need to know its range as a sub-range of the crossz range. This is required for the copies in steps 3 and 6 above.

Initial prototyping work has been done in gk-g0-app_mb_app_poisson: https://github.com/ammarhakim/gkylzero/tree/gk-g0-app_mb_app_poisson

This implementation of the field solver does not affect input files. The input for the field will be the same as it has been for single block apps.

Further Details/example code for the broadcasting step (step 4 in the rhs):

  for (int ic=0; ic<mbf->num_cuts; ic++) { // num_cuts is the total number of cuts in the simulation
    range_curr = mbf->cut_ranges[ic];
    loop over x and y cells {
      start_idx = {xcell, ycell, lower zcell of range_curr }
      data_start = data at start_idx
      send_size = nz*ncomp // nz = number of z cells in range_curr, ncomp is number of dg coeffs
      send_rank = rank that owns cut number ic // this is known from our initial setup at t=0
      gkyl_comm_bcast(zcomm, data_start, send_size, send_rank);
    }
manauref commented 1 week ago

This overall structure is good IMO, especially if the prototype is working. I only have one suggestion: I don't see why the allgather is strictly necessary. The MB app knows where every rank goes, and should be able to do this in a loop over bcast with the appropriate ranges.

Something that is not explicitly stated here but may affect some of these decisions is the fact that our data is in row-major order. For this reason I think we had a step in g2 where we re-arranged things after the communication to account for this, specifically in this method of GkField

function GkField:AllgatherFieldZ(fldLocal, fldGlobal)
   -- Gather a CartField distributed along z onto a global-in-z field.
   fldLocal:copyRangeToBuffer(fldLocal:localRange(), self.localBuffer:data())

   -- Gather flat buffers from other processes.
   local mess = self.grid:getMessenger()
   local comm = mess:getComms()["z"]
   mess:Allgather(self.localBuffer, self.globalBufferZ, comm)

   -- Rearrange into a global field, i.e. reconstruct the multi-D array
   -- using the collection of flat buffers we gathered from each process.
   for zIdx in self.zCutsRange:colMajorIter() do
      fldGlobal:copyRangeFromBuffer(self.zDestRange[zIdx[1]],
         self.globalBufferZ:data()+self.zBufferOffset[zIdx[1]])
   end
end

I believe the same thing is possible, and should perhaps be done here, it's just that the ranges now are from other blocks and not just from other portions of a decomp.

We may however pursue this as a later optimization depending on current priorities.

ammarhakim commented 1 week ago

I would like to see this DR presented so we can make suggestions on improving it.

Most important, as we discussed yesterday, I want to separately see a DR on the modifications to the comm infra. I do not want to see arbitrary changes to present comm without discussion. We need a proper, hierarchical system of comms (perhaps, not sure till one has thought about this some more).

ammarhakim commented 2 days ago

I am closing this DR as I do not think this is the proper design. This needs a proper re-design, accounting for hierarchical communication, and respect the layering of 1 App per block. This design mixes ranks, communicators and blocks, hence confusing the issue on how communication should be done.

For example, you should compute charge density in each block. Then the parent, who knows the topology, should assemble the large rhoc global array. This should be smoothed. At this point the smoothed rhoc is already in a big global vector. Then the Poisson eqn should be solved. This will result in a giant phi array that only the parent has yet access to. The parent, knowing the topology, appropriately then send each block its portion of phi.