CharacteristicMappingMethod / cmm-turbulence

CMM Turbulence code
GNU General Public License v3.0
1 stars 0 forks source link

Hermite interpolation strided access #25

Open Arcadia197 opened 2 years ago

Arcadia197 commented 2 years ago

For now, the strided access in the hermite interpolation is one of the main bottlenecks of the code performance. Some clever ideas have to be tested so that this will gain some improvement.

Possible ways could be: Have another look at the organization of the files to reorganize psi so that all data of one point are next to each other. This was already tested once and lead to no improvement, however, the reads were not made uniform, so maybe this could be tested to only lead to 2 consecutive reads.

However, the warping structure of the dipheomorphism makes it hard to predict at the boundaries where to read the data. Separating the cases in boundary- and non-boundary-cases however will lead to divergent code.

Arcadia197 commented 2 years ago

With new tests I was directly able to link the hermite interpolation to time increase. When increasing only the map update epsilon, the footpoints are computed further away and may start to utilize other hermite points for interpolation. After computations I am able to confirm this, as I experienced a quite severe time increase. This somehow links the speed of the hermite interpolation to the map update epsilon and the grid size of the velocity

Arcadia197 commented 2 years ago

I thought about this now for a while. If the CFL number is smaller than 1, we know that for each time step we only need ((16+2) * N_p / N_c)^2 points for the advection of one block, as the new solution does not travel further. This could be then really sped up with loading all values of a block beforehand and reusing them, which should end up in a MASSIVE speed boost.

However:

  1. N_p seems to be a critical parameter (still under investigation) and already with a factor of 2 we get problems with shared memory, that we may load too many values. A fix could be to make the blocks smaller (by repetition of computations with less threads per block). However, this has to be achieved quite flexible and sounds hard
  2. The cfl condition thingy might be set for large computations, but until now i never achieved it with my test computations, workarounds and if-clauses to decide that sound annoying. Nevertheless we can always get the maximum velocity as the maximum norm of the initial velocity and therefore compute the cfl number beforehand (nice)

Thoughts I had to this until now: