OPM / opm-common

Common components for OPM, in particular build system (cmake).
http://www.opm-project.org
GNU General Public License v3.0
30 stars 111 forks source link

"TRAN*" keywords has problem in running in parallel #2282

Closed GitPaean closed 3 years ago

GitPaean commented 3 years ago

test_case.tar.gz

Above is the test case to reproduce the problem. It has the following setup

109 
110  EQUALS
111 'TRANX' 0.0   19 19  1 2 10 10 /
112 'TRANY' 0.0   20 20  1 2 10 10 /
113 'TRANZ' 0.0   20 20  1 2  9  9 /
114 /
115 

It can reproduce the problem with

mpirun -np 4

It looks like it enters the following function twice,

void apply_tran(const std::unordered_map<std::string, Fieldprops::TranCalculator>& tran,

The first time is okay, it has all the following operations

 id TRANZ0 double_data.size() 9
 id NTG double_data.size() 9
 id TRANY0 double_data.size() 9
 id TRANX0 double_data.size() 9
 id PERMY double_data.size() 9
 id PERMX double_data.size() 9
 id PERMZ double_data.size() 9
 id PORO double_data.size() 9

While the second time, it only has five operations

 id NTG double_data.size() 5
 id PERMX double_data.size() 5
 id PERMY double_data.size() 5
 id PERMZ double_data.size() 5
 id PORO double_data.size() 5

Did not dig in much more due to limited time.

blattms commented 3 years ago

Please be more precise. I guess with has problem you mean the following?

Number of saturation regions: 1

Program threw an exception: Could not initialize the problem: map::at (on rank 0)
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[63205,1],2]
  Exit code:    1
GitPaean commented 3 years ago

Yes. Sorry. I should have been more precise.

Program threw an exception: Could not initialize the problem: map::at (on rank 0)
blattms commented 3 years ago

@joakim-hove might this be due to refactorings in opm-common? I am pretty sure that previously eclState.globalFieldProps().keys<double> called here did list also the keys needed by the TranCalculator. Now it does not anymore.

We need to have a way to get all keys, or change accordingly.

joakim-hove commented 3 years ago

might this be due to refactorings in opm-common?

The current ongoing opm-common refactoring should not be related to this - but you never know of course. I am a bit stretched today, but I can try to look into this tomorrow.

blattms commented 3 years ago

The problem seems to be that FieldProps::keys<double>() skips the data needed by the calculator as it is invalid. In our case FieldData::value_status is a vector filled with Opm::value::status::uninitialized, and hence FieldData::valid() return false here.

Therefore none of the field datas needed by the calculator are broadcasted in PropsCentroidDatahandle.

Somehow we need to get the keys of the valid FieldData and the FieldData needed by the TransCalculator.

blattms commented 3 years ago

OPM/opm-simulators#3056 should fix this.

Please use your test case to create an automated test. Thanks a lot.

GitPaean commented 3 years ago

OPM/opm-simulators#3056 should fix this.

Please use your test case to create an automated test. Thanks a lot.

It works well. Thanks for the fix.

To create an automated test, I need to make the case to the opm-tests first. I can do that.

But your fix can go in anytime when it is ready, since jenkins is fine with it.

blattms commented 3 years ago

Can you please check current master. Another fix #2292 might have been merged without heads up here.

GitPaean commented 3 years ago

Can you please check current master. Another fix #2292 might have been merged without heads up here.

The master branch does not work with the test case attached above.

Processing dynamic information from
/home/kaib/OPM-devel5/opm/opm-tests/aquifer-num/test_case/3D_2AQU_NUM_NO_AQU.DATA line 207
Initializing report step 1/7 at 2018-11-01 0.0 DAYS line 207
Processing keyword TUNING at line 210
Processing keyword RPTRST at line 215
Processing keyword WELSPECS at line 219
Processing keyword COMPDAT at line 223
Processing keyword WCONHIST at line 227
Complete report step 1 (30.0 DAYS) at 2018-12-01 (30.0 DAYS)

Initializing report step 2/7 at 2018-12-01 (30.0 DAYS) - line 231
Complete report step 2 (31.0 DAYS) at 2019-01-01 (61.0 DAYS)

Initializing report step 3/7 at 2019-01-01 (61.0 DAYS) - line 231
Processing keyword TUNING at line 236
Complete report step 3 (59.0 DAYS) at 2019-03-01 (120.0 DAYS)

Initializing report step 4/7 at 2019-03-01 (120.0 DAYS) - line 241
Complete report step 4 (31.0 DAYS) at 2019-04-01 (151.0 DAYS)

Initializing report step 5/7 at 2019-04-01 (151.0 DAYS) - line 241
Complete report step 5 (91.0 DAYS) at 2019-07-01 (242.0 DAYS)
Report limit reached, see PRT-file for remaining Schedule initialization.

[kaib-dell-ws:1327193] *** An error occurred in MPI_Allreduce
[kaib-dell-ws:1327193] *** reported by process [2683305985,0]
[kaib-dell-ws:1327193] *** on communicator MPI_COMM_WORLD
[kaib-dell-ws:1327193] *** MPI_ERR_TRUNCATE: message truncated
[kaib-dell-ws:1327193] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[kaib-dell-ws:1327193] ***    and potentially your MPI job)
joakim-hove commented 3 years ago

The master branch does not work with the test case attached above.

OK - I am guilty of the #2292 "fix" - I'll look into this.

blattms commented 3 years ago

I know what is happening (I have been there). You still cannot access the fieldData, because an exception will be thrown as it is invalid. BTW there are a lot of checks (validity is indeed checked twice!)

blattms commented 3 years ago

Just use the part of my fix for get_double (with proper credit!).

joakim-hove commented 3 years ago

Just use the part of my fix for get_double (with proper credit!).

Sorry; did not see that. Should have done as you suggested.

blattms commented 3 years ago

I am closing this now manually as #2299 got merged. Please consider referencing bug reports in the future, as this saves time.