parallel results show obvious difference with using PORV to modify the pore volume directly

GitPaean commented 3 years ago

It is part of the efforts to investigate the reason why the parallel results are different in parallel with PR https://github.com/OPM/opm-simulators/pull/3039 . (https://github.com/OPM/opm-common/issues/2282 is also due to the same reason)

The test case used is attached here. test_porv_parallel.tar.gz

Without the following keywords

148 EQUALS
149  'PORV' 0.25000000E+10   20 20  1 1  10 10 /
150  'PORV' 0.26400003E+10   20 20  2 2  10 10 /
151 /

The serial results and the results with 4 mpi processes are the same. ftaq-0104.pdf

But with the above setup, the serial results and parallel results are different. ftaq-0104.pdf

The running parameter is --enable-tuning=true.

blattms commented 3 years ago

Actually if you access PORV via the prov() method in the (Parallel)FieldPropsManager, the values should be broadcasted from rank 0. As along is the EQUALS operation is applied directly in in the parser (@joakim-hove should know) the values should be correct.

blattms commented 3 years ago

Are we maybe summing pore values of a few cells without neglecting the ghosts and by coincidence the porv of the ghosts is zero without EQUALS?

joakim-hove commented 3 years ago

Will investigate ...

joakim-hove commented 3 years ago

Have added a testcase for this: https://github.com/OPM/opm-tests/pull/489 - will continue the investigations

blattms commented 3 years ago

I did take another look at ParallelFieldPropsManger::porv() and have to conclude that it will only work if called with argument true (to get a global field properties vector). When passing false or no argument, the resulting array should be bigger than the number of cells. It will be an array with the condensed field properties of the undistributed grid. That is not usable.

if called with false or no argument, we should still use true in the underlying call on the root process and use the CartesianIndexMapper to map the broadcasted result to a local array. Not very pretty but should work.

@GitPaean how are you calling this function?

joakim-hove commented 3 years ago

@GitPaean how are you calling this function?

@GitPaean should of course answer himself; but my understdanding is that is he is "just running flow". I think you are correct with your thoughts about global / not-global properties, the PORV record is special - it must store and manage all cells, also the inactive. I have just merged the testcase: OPM/opm-tests#489 but not yet looked at the problem.

GitPaean commented 3 years ago

@GitPaean should of course answer himself; but my understdanding is that is he is "just running flow".

Yes. The results are from the flow with the master branch of OPM.

blattms commented 3 years ago

~Maybe I should rephrase my question (with you, I meant your aquifer code):~

~- how does the Aquifer code use porv, change it or the grid? Aren't there some pseudo cells introduced?~

Or simply reread the description.

What I am wondering: What is special about the model? Do you need to activate tuning to see the problem?

GitPaean commented 3 years ago

What I am wondering: What is special about the model? Do you need to activate tuning to see the problem?

Very good question. I tested a little more, it is not only the PORV that caused the problem. There two cells with PORV are connected to many NNCs. Maybe it is a combination of PORV and NNC, or when handling NNC, there is some inconsistency related to volume or pore volumes?

I am recently busy due to family reason. If there is still not clue regarding to the issue after a few days (the coming Thursday), I can help to debug or providing more information. Numerical aquifer itself is simple in theory, but it touches so many facilities deep in OPM-flow, it creates challenges. Hopefully, it is also helping to fix some potential problems.

joakim-hove commented 3 years ago

OK - I have investigated this further. My analysis indicates that PORV is distributed correctly, also with the EQUALS modifier:

EQUALS
 'PORV' 0.25000000E+10   20 20  1 1  10 10 /
 'PORV' 0.26400003E+10   20 20  2 2  10 10 /
/

However for this particular model pore volume of the other cells is ~ 12500 - i.e. the two cells singled out for special treatment are a factor of ~ 2 * 10⁶ larger than the remaining cells; I have also tried to write the modifier as:

MULTIPLY
   'PORV'  200000 20 20 1 1 10 10 /
   'PORV'  250000 20 20 2 2 10 10 /
/

which fails in the same way as the original deck. If the scaling factor is turned down the test passes. So all in all my suggestion is that the problem becomes ill conditioned, and that the parallel and serial simulations respond differently to the bad conditioning?

@blattms, @totto82, @atgeirr

Testcase: https://github.com/OPM/opm-tests/blob/master/parallel_fieldprops/3D_PORV_OPERATOR.DATA Activate testcase PR: https://github.com/OPM/opm-simulators/pull/3086

GitPaean commented 3 years ago

So all in all my suggestion is that the problem becomes ill conditioned, and that the parallel and serial simulations respond differently to the bad conditioning?

Then we have to accept that for numerical aquifers, the parallel results are different from the serial results, which is a little tough to accept.

I will also do more testing regarding to this and will keep you updated.

joakim-hove commented 3 years ago

Then we have to accept that for numerical aquifers, the parallel results are different from the serial results, which is a little tough to accept.

Well; I am not arguing the case very strongly - but so far this is my conclusion (guess!) - and I strongly encourage someone more into the linear algebra to look into this.

blattms commented 3 years ago

So is there a problem without numerical aquifers? If not, then maybe the problem is in the parallel aquifer code. Unfortunately, i have no idea how that works.

BTW: What lead you to think that the porv values are wrong? Did you test them? Maybe the place where you noticed gives a hint,

GitPaean commented 3 years ago

So is there a problem without numerical aquifers? If not, then maybe the problem is in the parallel aquifer code. Unfortunately, i have no idea how that works.

Here is talking about the master branch. We use different keywords to mimic the behavior of numerical aquifers to figure out why the parallel results are different with numerical aquifers in PR #3039

BTW: What lead you to think that the porv values are wrong? Did you test them? Maybe the place where you noticed gives a hint,

I did not say the pore volumes are wrong, but it is just the deck with PORV specification (the case attached above) causes the parallel results different or reproduces the problem in PR #3039.

For sure @joakim-hove 's information is very important, and the theory can be the truth, hopefully, we can do some fix regarding to this. I am investigating something similar/relevant related to the pore volumes used in the simulator code (not related to parallel, while also causing the results changed tiny), maybe not immediately, but I will have some updates regarding to this issue soon. Also because of that observation, @joakim-hove 's statement can be very sensible. But different results when running in parallel itself is a problem we need to tackle.

GitPaean commented 3 years ago

A small update, with the following running parameters

--enable-tuning=true --tolerance-cnv=0.001 --max-strict-iter=20 --tolerance-mb=1.e-7 --tolerance-wells=0.00001

The results with 1 proc and 4 procs agree very well. So @joakim-hove 's explanation related to numerical cause is probably correct.

ftaq-0104.pdf

No time steps uses more than 4 newton iterations, --max-strict-iter=20 should not matter in theory, but however, it is required to obtain the results that agree well. No clue why yet. Needs to check the usage of --max-strict-iter=20.

alfbr commented 3 years ago

No time steps uses more than 4 newton iterations, --max-strict-iter=20 should not matter in theory, but however, it is required to obtain the results that agree well. No clue why yet. Needs to check the usage of --max-strict-iter=20.

From the discussion it soundss like the pore volumes on the test-case is on the extreme side. Consider testing with --relaxed-max-pv-fraction=0

GitPaean commented 3 years ago

--relaxed-max-pv-fraction=0 also does the trick.

From the discussion it soundss like the pore volumes on the test-case is on the extreme side.

Yes. But it is numerical aquifer case we have and they use one or a few cells to represent the whole aquifer, it needs to be huge unless extremely small aquifer.

GitPaean commented 3 years ago

212 template<class TypeTag>
213 struct MaxStrictIter<TypeTag, TTag::FlowModelParameters> {
214     static constexpr int value = 0;
215 };

By default, it is 0 now. Is it desired?

GitPaean commented 3 years ago

The reason is that relaxed tolerance (by default 1, which is very relaxed) is used through the following statement

 const bool use_relaxed = cnvErrorPvFraction < param_.relaxed_max_pv_fraction_ && iteration >= param_.max_strict_iter_;

By default param_.max_strict_iter_ is zero.

alfbr commented 3 years ago

--relaxed-max-pv-fraction=0 also does the trick.

From the discussion it soundss like the pore volumes on the test-case is on the extreme side.

Yes. But it is numerical aquifer case we have and they use one or a few cells to represent the whole aquifer, it needs to be huge unless extremely small aquifer.

No problem, just use the --relaxed-max-pv-fraction=0 for the regression testing.

By default, it is 0 now. Is it desired?

Yes, but in combination with --relaxed-max-pv-fraction=0.03 it actually means that all iterations are strict. Setting that default was actually a lot of work and testing. You will find part of the process here: https://github.com/OPM/opm-simulators/pull/2788

GitPaean commented 3 years ago

No problem, just use the --relaxed-max-pv-fraction=0 for the regression testing.

That is okay. It is possible we need to introduce numerical aquifer specific logic when deciding whether to use relaxed tolerance unless we want to have a running arguments for these case.

alfbr commented 3 years ago

No problem, just use the --relaxed-max-pv-fraction=0 for the regression testing.

That is okay. It is possible we need to introduce numerical aquifer specific logic when deciding whether to use relaxed tolerance unless we want to have a running arguments for these case.

That is a good point. I am not sure how much of an issue it is on realistic models, we will just have to test it thoroughly I guess.

GitPaean commented 3 years ago

That is a good point. I am not sure how much of an issue it is on realistic models, we will just have to test it thoroughly I guess.

It looks it will be case depedent. For the two cases in opm-tests, it matters for 3D_2AQU_NUM.DATA, it does not matter much for 3D_1AQU_3CELLS.DATA.

alfbr commented 3 years ago

That is a good point. I am not sure how much of an issue it is on realistic models, we will just have to test it thoroughly I guess.

It looks it will be case depedent. For the two cases in opm-tests, it matters for 3D_2AQU_NUM.DATA, it does not matter much for 3D_1AQU_3CELLS.DATA.

Actually, neither model looks well suited for tuning default numerical parameters as they are not very representative. Generally speaking, I think it is good to have test models that exaggerate effects, and usually that works fine with default tuning. This seems to be a counter example though.

GitPaean commented 3 years ago

Thanks everyone of the help and efforts and sorry that the initial guessing was misleading (it was purely symptom based). The real reason is that with relaxed tolerance, the results are a little uncertain. Even with different master branch, the results are a little uncertain. My original numerical aquifer code had good results with 3D_2AQU_NUM.DATA, after rebasing to newer master, the results are not as good, both with default parameters and same aquifer code. The problem behaves differently with different cases with the relaxed tolerance, which made the investigation even harder. For example, removing some irrelevant keywords (like some NNC, which is not related to the problem itself, but it changes the case behavoir), might change the symptom.

By using --relaxed-max-pv-fraction=0, it fixes both the parallel running problem and the above problem.

alfbr commented 3 years ago

Thanks everyone of the help and efforts and sorry that the initial guessing was misleading (it was purely symptom based). The real reason is that with relaxed tolerance, the results are a little uncertain. Even with different master branch, the results are a little uncertain. My original numerical aquifer code had good results with 3D_2AQU_NUM.DATA, after rebasing to newer master, the results are not as good, both with default parameters and same aquifer code. The problem behaves differently with different cases with the relaxed tolerance, which made the investigation even harder. For example, removing some irrelevant keywords (like some NNC, which is not related to the problem itself, but it changes the case behavoir), might change the symptom.

By using --relaxed-max-pv-fraction=0, it fixes both the parallel running problem and the above problem.

Thanks for your patience on this one, knowing the pitfalls of using numerical aquifers with default tuning is valuable in its own right.

atgeirr commented 3 years ago

Your results indicate that the original ones were not properly converged solutions. I think our current CNV policies, accepting large errors in 3% of the pore volume by default are a bit dangerous, and bit us in this situation. 3% of the pore volume is probably the rest of the grid, apart from the "faquifer" (fake aquifer...) cells. Maybe we should make it a little more robust by also counting the number of bad cells, and requiring that also to be below some limit?

Edit: I realize I repeated some of what was already said, for some reason the tab had not updated to include all the latest comments when I made mine, sorry! But my question is still valid I think.

alfbr commented 3 years ago

Your results indicate that the original ones were not properly converged solutions. I think our current CNV policies, accepting large errors in 3% of the pore volume by default are a bit dangerous, and bit us in this situation. 3% of the pore volume is probably the rest of the grid, apart from the "faquifer" (fake aquifer...) cells. Maybe we should make it a little more robust by also counting the number of bad cells, and requiring that also to be below some limit?

That is an interesting idea. Please keep in mind though, that there will always be corner cases where whatever criterion we chose will end up giving large errors. Hence, changing default tuning parameters based on corner cases that will never be encountered in real data sets is not a viable approach. Not saying this is the case here, but the defaults we currently have is a careful consideration of accuracy and run time across a large collection of datasets. Hence, I suggest we revisit this when we have had a chance to test it on a variety of cases involving aquifers. If you have any such cases exhibiting porblems, please share your experience.

atgeirr commented 3 years ago

@GitPaean I guess this can be closed?

OPM / opm-simulators

parallel results show obvious difference with using PORV to modify the pore volume directly #3081