OPM / opm-simulators

Simulator programs and utilities for automatic differentiation.
http://www.opm-project.org
GNU General Public License v3.0
112 stars 122 forks source link

Convergence failure - flow_ebos - model2 #1066

Closed nairr closed 7 years ago

nairr commented 7 years ago

For one of the realizations of model2 (realization-9), the solver fails to converge at time = 1827 days due to NaN residual for the water phase, when using the current build of flow_ebos. A version of flow_ebos compiled on 29/01/'17 could run the model successfully.

conv_fail

GitPaean commented 7 years ago

NaN residual for phase water is also stopping the running with Model 2.2 with flow_ebos. Not clear how was that triggered yet.

andlaus commented 7 years ago

hm. can you try to compile everything using -Og and without -DNDEBUG? the runtime should roughly double with this, but you will get the assertations and you can reasonably inspect it using gdb.

(I'd do this myself, but so far I don't have that deck.)

andlaus commented 7 years ago

I've just noticed that even before the breakdown your timesteps are very small (10^-8 days), probably this points to the problem.

GitPaean commented 7 years ago

Yes. Maybe @nairr can post the residual history a little bit before the crash.

nairr commented 7 years ago

I've created a gist of the simulation step prior to failure here: https://gist.github.com/nairr/eeb7952c0029e05d4c14f4f1129ba7e7

@andlaus - I will try building with -Og and get back to you

andlaus commented 7 years ago

strange: I just successfully completed a full run of (my version of) Model 2.2. can you somehow give me access to your versions? (send me a private mail.)

GitPaean commented 7 years ago

That is a really great news. If you can share with with me the summary file and .DEBUG file, PRT file, it will be great. (kai.bao (AT) sintef.no)

I will find out if I can share with you the Model 2.2 file tomorrow.

GitPaean commented 7 years ago

From my side, although on a different deck, the nan residual for water phase comes from nan values of R_sum. Not sure how this happens yet.

GitPaean commented 7 years ago

the nan residual for water phase comes from too big BHP value from my running.

andlaus commented 7 years ago

do you have any idea if this is caused by the well model or by the reservoir model? in both cases it is possible that the PVT extrapolation in opm-material goes belly-up...

GitPaean commented 7 years ago

The problem I am focusing on now is the THP related. Obviously, the THP control part is not completed yet. And we are investigating how we got so obviously wrong BHP value (5000 or 6000 Barsa or so) through VFPProd table (the estimate should be a few hundred bar by looking at the table). Probably we will find something tomorrow. Hopefully,

andlaus commented 7 years ago

ok, cool. if I can help, let me know.

totto82 commented 7 years ago

I get the same when I run realization 0 with the current master. Using tolerance_wells=1e-3 or 1e-5 helped. The reason for the NaNs is the small timesteps cased by convergence problems.

atgeirr commented 7 years ago

GitHub's auto-closing was incorrect in this case. Reopening.

blattms commented 7 years ago

Small timesteps means we switch to float, right? Might this have caused problems?

atgeirr commented 7 years ago

Small timesteps means we switch to float, right?

"Small" used to mean < 20 days for legacy flow. For flow_ebos I am not sure where the boundary is, @andlaus or @dr-robertk?

andlaus commented 7 years ago

I don't really know, because to me the flow_ebos linear solver is a black box. given that single precision floating point values exhibit a precision of about 7 decimal digits, it would make sense. (The NaNs are likely not the root of the problem, though.)

GitPaean commented 7 years ago

Yes. I agree, NaNs typically are not the root of problem. I am investigating a situation looks like some NaNs is generated during the linear solution.

 outputting the wellSolutions before updateWellState 
2.43964e+07
2.91488e+07
3.62471e+07
0.822454
1
1
0.09896
0
0
dx2_limited 0.142907
dx3_limited -0.127961
 dx1_limited 216357
dx2_limited nan
dx3_limited nan
 dx1_limited nan

dx2_limited nan
dx3_limited nan
 dx1_limited nan
GitPaean commented 7 years ago

The reason for nan looks like because resWell_ has nan. So there is some other reason deeper.

dr-robertk commented 7 years ago

Guys, flow_ebos uses the exact same linear solver as flow_legacy. So no more black box here. Also, single precision is disabled, because otherwise the matrix would have to be copied.

atgeirr commented 7 years ago

single precision is disabled

So that cannot be the reason for failures. Good!

andlaus commented 7 years ago

So no black box here.

just wanted to say that I treat it as a black box because I do not understand that code, i.e. my statements about it should be taken with a pinch of salt ;)

GitPaean commented 7 years ago

It looks like combing PR #1083 and PR #1091 will fix the problem, not finish running yet, while already further. Only PR #1091 will not. Did not test with PR #1083 only.

andlaus commented 7 years ago

please close the issue if it is fixed. (I somehow failed in completely reproducing it...)

GitPaean commented 7 years ago

Did you also try that realization (realization-9)? I reproduced the problem in the same time step while slight different symptom (No NaN invloved), probably due to some recent change.

Let us wait a little bit until the PR #1083 and PR #1091 get merged, then we can it is settled.

andlaus commented 7 years ago

Did you also try that realization (realization-9)?

no, I haven't realized that it is that realization (sic). great that you fixed it, though.

GitPaean commented 7 years ago

Hi, @nairr , from my side, it looks like PR #1083 fixed this problem. Could you please help to verify it?

nairr commented 7 years ago

Hi, I still experience the same convergence issue with NaN residual for water phase with PR #1083

GitPaean commented 7 years ago

That is weird. Did you update all the modules to the latest version of master?

GitPaean commented 7 years ago

Basically, I reproduced your problem in a slightly different way with the latest master branch and I applied PR #1083, and it could run through.

nairr commented 7 years ago

I did update all the modules to the latest master branch. However I did not apply PR #1091

GitPaean commented 7 years ago

Okay. I will try again. In my previous experience, it is PR #1083 that affects this issue.

GitPaean commented 7 years ago

Hi, @nairr , I tested again, with PR #1083 and master branches, the issue is fixed.

Time step  115 at day 1827/5997, date = 01-Jan-2005
  Substep 0, stepsize 31 days.
Error: [/home/kaib/OPM-master-test/debug/opm-simulators/opm/autodiff/NonlinearSolver_impl.hpp:154] Failed to complete a time step within 15 iterations.
Problem: Solver convergence failed, restarting solver with new time step (10.230000 days).

  Substep 0, stepsize 10.23 days.
    Substep summary: well iterations = 4, newton iterations = 7, linearizations = 8 (8.4966 sec), linear iterations = 187 (11.2829 sec)
  Substep 1, stepsize 20.77 days.
Error: [/home/kaib/OPM-master-test/debug/opm-simulators/opm/autodiff/NonlinearSolver_impl.hpp:154] Failed to complete a time step within 15 iterations.
Problem: Solver convergence failed, restarting solver with new time step (6.854100 days).

  Substep 1, stepsize 6.8541 days.
    Substep summary: well iterations = 5, newton iterations = 12, linearizations = 14 (14.9858 sec), linear iterations = 314 (18.9499 sec)
  Substep 2, stepsize 13.9159 days.
Error: [/home/kaib/OPM-master-test/debug/opm-simulators/opm/autodiff/NonlinearSolver_impl.hpp:154] Failed to complete a time step within 15 iterations.
Problem: Solver convergence failed, restarting solver with new time step (4.592247 days).

  Substep 2, stepsize 4.59225 days.
    Substep summary: well iterations = 8, newton iterations = 16, linearizations = 19 (20.2524 sec), linear iterations = 411 (24.553 sec)
  Substep 3, stepsize 9.32365 days.
    Substep summary: well iterations = 9, newton iterations = 28, linearizations = 32 (33.1254 sec), linear iterations = 591 (35.4222 sec)

Time step  116 at day 1858/5997, date = 01-Feb-2005
  Substep 0, stepsize 20 days.
    Substep summary: well iterations = 3, newton iterations = 8, linearizations = 9 (8.92022 sec), linear iterations = 224 (13.2843 sec)

Time step  117 at day 1878/5997, date = 21-Feb-2005
  Substep 0, stepsize 6 days.
    Substep summary: well iterations = 2, newton iterations = 5, linearizations = 6 (6.01052 sec), linear iterations = 94 (5.84578 sec)
nairr commented 7 years ago

My bad, the issue is indeed fixed.

GitPaean commented 7 years ago

Looks like the problem is back again, not sure which change caused the problem. Somehow the CNV for oil phase really like the number 1.594e-01

90372   Substep 12, stepsize 4.46778e-05 days.
90373 Iter  W-FLUX(water)  W-FLUX(oil)  W-FLUX(gas)
90374    0  1.256e-05  1.736e-06  6.123e-07
90375 Iter    MB(W)      MB(O)      MB(G)      CNV(W)     CNV(O)     CNV(G)   W-FLUX(W)  W-FLUX(O)  W-FLUX(G)
90376    0  4.889e-10  8.672e-10  1.215e-10  1.268e-06  3.941e-07  1.590e-06  1.256e-05  1.736e-06  6.123e-07
90377    1  1.686e-13  1.640e-06  1.204e-06  6.008e-10  1.594e-01  1.170e-01  6.508e-09  2.125e-08  3.804e-11
90378    2  2.976e-07  1.364e-06  4.094e-07  6.393e-02  1.594e-01  3.530e-02  2.936e-13  9.558e-13  6.675e-06
90379    3  8.955e-10  1.517e-06  1.330e-07  8.336e-04  1.594e-01  9.226e-03  7.185e-18  4.880e-16  6.099e-07
90380    4  9.052e-11  1.469e-06  5.055e-09  1.341e-05  1.594e-01  4.963e-04  1.437e-17  5.830e-16  2.915e-07
90381    5  3.625e-11  1.463e-06  1.687e-11  3.984e-06  1.594e-01  7.007e-06  1.796e-17  5.599e-16  4.647e-09
90382    6  4.051e-11  1.463e-06  2.286e-12  4.006e-06  1.594e-01  6.990e-06  2.155e-17  3.544e-16  1.409e-12
90383    7  4.083e-11  1.463e-06  2.707e-12  4.007e-06  1.594e-01  6.990e-06  2.066e-17  4.880e-16  5.906e-15
90384    8  4.086e-11  1.463e-06  2.735e-12  4.007e-06  1.594e-01  6.990e-06  2.697e-17  4.469e-16  3.573e-15
90385    9  4.086e-11  1.463e-06  2.737e-12  4.007e-06  1.594e-01  6.990e-06  1.841e-17  1.464e-16  1.739e-15
90386   10  4.086e-11  1.463e-06  2.737e-12  4.007e-06  1.594e-01  6.990e-06  2.697e-17  5.316e-16  6.855e-16
90387   11  4.086e-11  1.463e-06  2.737e-12  4.007e-06  1.594e-01  6.990e-06  1.078e-17  5.419e-16  1.050e-15
90388   12  4.086e-11  1.463e-06  2.737e-12  4.007e-06  1.594e-01  6.990e-06  2.155e-17  3.030e-16  6.855e-16
90389   13  4.086e-11  1.463e-06  2.737e-12  4.007e-06  1.594e-01  6.990e-06  2.155e-17  4.263e-16  1.369e-15
90390   14  4.086e-11  1.463e-06  2.737e-12  4.007e-06  1.594e-01  6.990e-06  2.697e-17  3.082e-16  2.195e-15
90391   15  4.086e-11  1.463e-06  2.737e-12  4.007e-06  1.594e-01  6.990e-06  1.437e-17  4.058e-16  1.541e-15
90392 [/home/kaib/OPM-master-test/debug/opm-simulators/opm/autodiff/NonlinearSolver_impl.hpp:154] Failed to complete a time step within 15 iterations.
90393 Caught Exception: [/home/kaib/OPM-master-test/debug/opm-simulators/opm/autodiff/NonlinearSolver_impl.hpp:154] Failed to complete a time step within 15 iterations.
90394 Solver convergence failed, restarting solver with new time step (0.000015 days).
bska commented 7 years ago

Looks like the problem is back again,

Does reverting OPM/opm-core#1147 change these results?

GitPaean commented 7 years ago

Thanks for the suggestion. I was testing OPM/opm-parser#1051 . Will also test OPM/opm-core#1147 soon later.

It is just trying to get some clues for future reference. I do not we spend efforts on this. The simulator still shows some random/unpredictable behavior from time to time. For example, OPM/opm-simulators#1112 fixes this problem. There are many more changes related to the simulator that makes some convergence problems appear/disappear from time to time.

GitPaean commented 7 years ago

It looks like reverting either OPM/opm-parser#1051 or OPM/opm-core#1147 will fix the running of model 2, realization 9.

GitPaean commented 7 years ago

@totto82 are the PRs OPM/opm-parser#1051 and OPM/opm-core#1147 related or not. Do these two PRs interact in some way?

totto82 commented 7 years ago

No. They should not interact directly. The first one is necessary for the simulator to apply the scaled capillary pressures due to SWATINIT. The second fixes initial rs and rv values. I am not surprised that the first one effects the simulator, but the last one should only have minor impact on the simulator.