Hermite interpolation strided access

Arcadia197 commented 2 years ago

For now, the strided access in the hermite interpolation is one of the main bottlenecks of the code performance. Some clever ideas have to be tested so that this will gain some improvement.

Possible ways could be: Have another look at the organization of the files to reorganize psi so that all data of one point are next to each other. This was already tested once and lead to no improvement, however, the reads were not made uniform, so maybe this could be tested to only lead to 2 consecutive reads.

However, the warping structure of the dipheomorphism makes it hard to predict at the boundaries where to read the data. Separating the cases in boundary- and non-boundary-cases however will lead to divergent code.

Arcadia197 commented 2 years ago

With new tests I was directly able to link the hermite interpolation to time increase. When increasing only the map update epsilon, the footpoints are computed further away and may start to utilize other hermite points for interpolation. After computations I am able to confirm this, as I experienced a quite severe time increase. This somehow links the speed of the hermite interpolation to the map update epsilon and the grid size of the velocity

Arcadia197 commented 2 years ago

I thought about this now for a while. If the CFL number is smaller than 1, we know that for each time step we only need ((16+2) * N_p / N_c)^2 points for the advection of one block, as the new solution does not travel further. This could be then really sped up with loading all values of a block beforehand and reusing them, which should end up in a MASSIVE speed boost.

However:

N_p seems to be a critical parameter (still under investigation) and already with a factor of 2 we get problems with shared memory, that we may load too many values. A fix could be to make the blocks smaller (by repetition of computations with less threads per block). However, this has to be achieved quite flexible and sounds hard
The cfl condition thingy might be set for large computations, but until now i never achieved it with my test computations, workarounds and if-clauses to decide that sound annoying. Nevertheless we can always get the maximum velocity as the maximum norm of the initial velocity and therefore compute the cfl number beforehand (nice)

Thoughts I had to this until now:

As stated under 2, we always have the cfl number beforehand and can therefore compute an estimate of points needed for the advection
We could also introduce dynamic loading, where only part of the points are loaded and the others are in shared memory, however this could lead to unbalanced memory handling maybe
In the end I have to thoroughly compute this once to get the maximum factor for N_psi and have to await for results of the importance study anyways.
Since Psi is in Hermite form and we sometimes need 3 times Psi, this really puts a big strain on loading, especially if we do lagrange interpolation. Maybe I can come up with a clever idea for this.

CharacteristicMappingMethod / cmm-turbulence

Hermite interpolation strided access #25