u=0 to NaN in a single time step for off-center y domains --sometimes

johnfgibson commented 5 years ago

I've been running ASBL simulations with y domain 0 < y < 4 and occasionally having the u flowfield blow up unexpectedly right away, and not due to CFL problem. I've reduced it down to the following: one time step on a zero velocity field with plane Couette conditions.

Actual result and steps to reproduce (bug for off-center y domain)


gibson@sophist$ randomfield -m 0 -Nx 128 -Ny 129 -Nz 162 -Lx 4 -Lz 2 -ymin 0 -ymax 4 uzero-128x129x162-yoffcenter
gibson@sophist$ mpirun -n 5 simulateflow -np0 1 -np1 5 -R 1000 -dt 0.005 -vdt false -Uwall 1 -T 0.005 -T0 0 -dT 0.005 uzero-128x129x162-yoffcenter.nc 
Constructing u,q, and optimizing FFTW...
Uwall == 1
Wwall == 0
dnsflags == nu==0.001, Vsuck==0, rotation==0, theta==0, dPdx==0, dPdz==0, Ubulk==0, Wbulk==0, uwall==1, uupper==1, ulower==-1, wupper==0, wlower==-0, t0==0, dT==0.005, dt==0.005, variabledt==0, dtmin==0.001, dtmax==0.2, CFLmin==0.4, CFLmax==0.6, LaminarBase, PressureGradient, SBDF3, SMRK2, Rotational, DealiasXZ, zero_bodyforce, TauCorrection, Silent
constructing DNS...
           t == 0
chebyNorm(u) == 0
        1/nu == 1000
  Uwall h/nu == 2000
  Ubulk h/nu == 0
  Umean h/nu == 0 * 2 / 0.001
  Umean h/nu == 0
 Uparab h/nu == 0
Ucenter h/nu == 0
         CFL == 0.335103
           t == 0.005
chebyNorm(u) == nan
        1/nu == 1000
  Uwall h/nu == 0
  Ubulk h/nu == 0
  Umean h/nu == -nan * 2 / 0.001
  Umean h/nu == -nan
 Uparab h/nu == 0
Ucenter h/nu == -nan
         CFL == -nan
L2Norm(u) is nan
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[54961,1],4]
  Exit code:    1
--------------------------------------------------------------------------

The chebyshev code for domains other than y in [-1,1] is not heavily tested, and I wouldn't be surprised if this was just flushing out an error in that. But surprisingly the problem is also dependent on the z discretization. Here I'll run the same simulation, y still off-center, but changing to Nz=128 rather than 162.

Expected Result

gibson@sophist$ randomfield -m 0 -Nx 128 -Ny 129 -Nz 128 -Lx 4 -Lz 2 -ymin 0 -ymax 4 uzero-128x129x128-yoffcenter
gibson@sophist$ mpirun -n 5 simulateflow -np0 1 -np1 5 -R 1000 -dt 0.005 -vdt false -Uwall 1 -T 0.005 -T0 0 -dT 0.005 uzero-128x129x128-yoffcenter.nc 
Constructing u,q, and optimizing FFTW...
Uwall == 1
Wwall == 0
dnsflags == nu==0.001, Vsuck==0, rotation==0, theta==0, dPdx==0, dPdz==0, Ubulk==0, Wbulk==0, uwall==1, uupper==1, ulower==-1, wupper==0, wlower==-0, t0==0, dT==0.005, dt==0.005, variabledt==0, dtmin==0.001, dtmax==0.2, CFLmin==0.4, CFLmax==0.6, LaminarBase, PressureGradient, SBDF3, SMRK2, Rotational, DealiasXZ, zero_bodyforce, TauCorrection, Silent
constructing DNS...
           t == 0
chebyNorm(u) == 0
        1/nu == 1000
  Uwall h/nu == 2000
  Ubulk h/nu == 0
  Umean h/nu == 0 * 2 / 0.001
  Umean h/nu == 0
 Uparab h/nu == 0
Ucenter h/nu == 0
         CFL == 0.335103
           t == 0.005
chebyNorm(u) == 4.84535e-16
        1/nu == 1000
  Uwall h/nu == 2000
  Ubulk h/nu == 0
  Umean h/nu == 0 * 2 / 0.001
  Umean h/nu == 0
 Uparab h/nu == 0
Ucenter h/nu == 0
         CFL == 0.335103
done!

Centering the Ly=4 domain about zero, with y in [-2,2] also works fine.

gibson@sophist$ randomfield -m 0 -Nx 128 -Ny 129 -Nz 162 -Lx 4 -Lz 2 -ymin -2 -ymax 2 uzero-128x129x162-ycentered                            
gibson@sophist$ mpirun -n 5 simulateflow -np0 1 -np1 5 -R 1000 -dt 0.005 -vdt false -Uwall 1 -T 0.005 -T0 0 -dT 0.005 uzero-128x129x162-ycentered.nc

...runs fine...

Information on your system

openSUSE Leap 15.0, x86-64, 6-core Intel Core i7-3690x

gibson@sophist$ cmake -DCMAKE_CXX_COMPILER=/usr/lib64/mpi/gcc/openmpi/bin/mpic++  -DCMAKE_INSTALL_PREFIX=~/channelflow-master -DWITH_SHARED=ON -DWITH_HDF5=ON ~/gitworking/channelflow

Additional context

I haven't run this in debug mode yet. Smaller ASBL simulations on y in [0,4] with Lx,Lz = 4,2 and 64 x 129 x 64 discretization integrate nicely for long times and have quite physically reasonable behavior.

sajjadazimi commented 5 years ago

@johnfgibson I could follow this problem down in the code to a single line but I need your math knowledge to solve it. These are my observations:

1 - NAN only happens when the initial stepping is smrk2 or cnrk2. So, for cnfe1 and sbdf1 which are of the type multistep DNSalgorithm, this error does not happen. 2 - The difference between these methods is that multistep.advance does not form the linear term of Navier Stokes. 3 - In the linear term, all the modes of the initial pressure after mx = 40 and mz = 53 are NAN. 5 - The initial pressure is created in the simulateflow program itself by a pressure solver class. 6 - In pressure solver class in the solve method at line 409, a variable called delta is calculated. For the modes after mx = 40 and mz = 53, delta becomes so big that the code returns inf. 7 - At line 411 and 412 of the same file c and d are calculated by the ratio of (alpha exp(gamma b)/delta where b is the height of the upper wall and alpha is zero. When b is 2, exp(gamma * b) is a big number but alpha (big number) is zero, and zero/inf is zero. So, c and d become zero and there is no problem. BUT when b_ = 4, the big number becomes so big that the code returns inf, and c = 0inf/inf = NAN. These c and d are used to calculate the initial pressure. 8 - When you set Nz = 128, this mode becomes an aliased mode and is not considered for linear term calculation, so again NAN does not happen.

I can solve this problem numerically, but my concern is actually the math behind as the expressions at lines 411 and 412 of Poisson solver are not independent of the location of lower and upper wall. For example, when you set a=0 and b =4 the terms should be the same as when you set a=100 and b=104.

johnfgibson commented 5 years ago

Nicely done. Thank you. I've just read your analysis and haven't looked at the code or the math yet, but I agree that the explicit dependence on the value of b (the upper wall position) rather than the difference b-a seems pretty suspect. I'll look into it further a.s.a.p.

johnfgibson commented 5 years ago

Ok, looking at the code now the problem seems pretty small and confined, and best of all there's a reference to a specific page in my handwritten math notebooks that chronicle the development of the code.

I haven't referred to those yet, but looking at the code and remembering what it's about, I suspect that the code is mathematically correct but subject to overflow of exp(gamma * b) and similar for large values of gamma and b, and should be refomulated mathematically to an equivalent expression that is more robust to calculation in floating-point.

johnfgibson commented 5 years ago

I have a correction for this and will post a pull request tomorrow. It was indeed a mathematically correct formulation that was subject to floating-point overflow of exp(mu*beta) for large values of mu and beta (changing notation from gamma to mu to avoid conflict with wavenumber gamma = 2pi/Lz). I reformulated the solutions of a homogeneous ODE g''(y) + mu^2 g(y) = 0 from exp(mu y) and exp(-mu y) to exp(mu (y-b)) and exp(-mu (y-a)), which makes them both bounded in [0,1] over the interval [a,b]. This avoids the overflow in the computation of the coefficients c and d in the solution g(y) = c exp(-mu (y-a)) + d exp(mu (y-b)) for the boundary value problem.

sajjadazimi commented 5 years ago

You're right. If you plug the expressions of c and d in g and combine the exp terms, it becomes independent of the absolute values of a and b. And it solves the floating-point overflow. Thanks

johnfgibson commented 5 years ago

Fixed by PR https://github.com/epfl-ecps/channelflow/pull/5

epfl-ecps / channelflow

u=0 to NaN in a single time step for off-center y domains --sometimes #4