GEOS-DEV / GEOS

GEOS Simulation Framework
GNU Lesser General Public License v2.1
210 stars 84 forks source link

Loose convergence when running integrated tests #628

Closed TotoGaz closed 3 years ago

TotoGaz commented 4 years ago

We did build GEOSX with gcc 7.3.0 qnd openmpi 2.1.5. We were able to run the integrated tests on a slurm cluster.

The following tests were OK:

PASSED : 63 ( 10x10x10_LaplaceFEM_01 10x10x10_LaplaceFEM_08 10x10x10_LaplaceFEM_27 50x10x5_LaplaceFEM_01 50x10x5_LaplaceFEM_08 50x10x5_LaplaceFEM_18 DryFrac_StaticPenny_PrismElem_08 deadoil_3ph_corey_1d_03 deadoil_3ph_baker_1d_01 deadoil_3ph_baker_1d_02 deadoil_3ph_baker_1d_03 deadoil_3ph_staircase_3d_01 deadoil_3ph_staircase_3d_08 compositional_multiphase_wells_1d_01 compositional_multiphase_wells_1d_02 compositional_multiphase_wells_2d_01 compositional_multiphase_wells_2d_04 dead_oil_wells_2d_01 dead_oil_wells_2d_04 staircase_compositional_multiphase_wells_3d_01 staircase_compositional_multiphase_wells_3d_08 sedov_1 sedov_8 sedov_27 sourceFlux_1d_01 sourceFlux_1d_02 sourceFlux_1d_03 compressible_1d_01 compressible_1d_02 compressible_1d_03 incompressible_1d_01 incompressible_1d_02 incompressible_1d_03 sourceFlux_2d_01 sourceFlux_2d_04 sourceFlux_2d_09 fractureFlow_2d_01 fractureFlow_2d_02 fractureFlow_2d_04 fractureJunctionFlow_2d_01 fractureMatrixFlow_2d_01 fractureMatrixFlow_2d_04 fractureMatrixFlow_2d_09 staircase_3d_01 staircase_3d_08 staircase_3d_27 compressible_single_phase_wells_1d_01 compressible_single_phase_wells_1d_02 incompressible_single_phase_wells_2d_01 incompressible_single_phase_wells_2d_04 staircase_single_phase_wells_3d_01 staircase_single_phase_wells_3d_08 SSLE-sedov_01 SSLE-sedov_08 SSLE-sedov_27 4comp_2ph_1d_01 4comp_2ph_1d_02 4comp_2ph_1d_03 4comp_2ph_cap_1d_01 4comp_2ph_cap_1d_02 4comp_2ph_cap_1d_03 deadoil_3ph_corey_1d_01 deadoil_3ph_corey_1d_02 )

But the following ones were not:

     Status     :  TestCase                                    :  Elapsed :  Resources :  TestStep     :  OutFile                                                                                                 
     ---------- :  ------------------------------------------- :  ------- :  --------- :  ------------ :  --------------------------------------------------------------------------------------------------------
     FAIL CHECK :  Hydrofracture_KGD_NodeBased_C3D6_01         :  0:00:15 :  0:00:15   :  restartcheck :  HydrofracturingSolver/Hydrofracture_KGD_NodeBased_C3D6_01/Hydrofracture_KGD_NodeBased_C3D6_01.data      
                :                                              :          :            :               :  HydrofracturingSolver/Hydrofracture_KGD_NodeBased_C3D6_01/Hydrofracture_KGD_NodeBased_C3D6_01.err       
     FAIL CHECK :  Hydrofracture_KGD_NodeBased_C3D6_09         :  0:00:26 :  0:04:02   :  restartcheck :  HydrofracturingSolver/Hydrofracture_KGD_NodeBased_C3D6_09/Hydrofracture_KGD_NodeBased_C3D6_09.data      
                :                                              :          :            :               :  HydrofracturingSolver/Hydrofracture_KGD_NodeBased_C3D6_09/Hydrofracture_KGD_NodeBased_C3D6_09.err       
     FAIL CHECK :  SurfaceGenerator_01                         :  0:00:07 :  0:00:07   :  restartcheck :  SurfaceGenerator/SurfaceGenerator_01/SurfaceGenerator_01.data                                           
                :                                              :          :            :               :  SurfaceGenerator/SurfaceGenerator_01/SurfaceGenerator_01.err                                            
     FAIL CHECK :  SurfaceGenerator_08                         :  0:00:12 :  0:01:41   :  restartcheck :  SurfaceGenerator/SurfaceGenerator_08/SurfaceGenerator_08.data                                           
                :                                              :          :            :               :  SurfaceGenerator/SurfaceGenerator_08/SurfaceGenerator_08.err                                            
     FAIL CHECK :  DryFrac_StaticPenny_PrismElem_01            :  0:00:18 :  0:00:18   :  restartcheck :  SurfaceGenerator/DryFrac_StaticPenny_PrismElem_01/DryFrac_StaticPenny_PrismElem_01.data                 
                :                                              :          :            :               :  SurfaceGenerator/DryFrac_StaticPenny_PrismElem_01/DryFrac_StaticPenny_PrismElem_01.err                  
     FAIL CHECK :  DryFrac_ThreeNodesPinched_HorizontalFrac_01 :  0:00:06 :  0:00:06   :  restartcheck :  SurfaceGenerator/DryFrac_ThreeNodesPinched_HorizontalFrac_01/DryFrac_ThreeNodesPinched_HorizontalFrac_01.data
                :                                              :          :            :               :  SurfaceGenerator/DryFrac_ThreeNodesPinched_HorizontalFrac_01/DryFrac_ThreeNodesPinched_HorizontalFrac_01.err
     FAIL CHECK :  DryFrac_ThreeNodesPinched_HorizontalFrac_08 :  0:00:12 :  0:01:39   :  restartcheck :  SurfaceGenerator/DryFrac_ThreeNodesPinched_HorizontalFrac_08/DryFrac_ThreeNodesPinched_HorizontalFrac_08.data
                :                                              :          :            :               :  SurfaceGenerator/DryFrac_ThreeNodesPinched_HorizontalFrac_08/DryFrac_ThreeNodesPinched_HorizontalFrac_08.err
     FAIL CHECK :  poroElasticCoupling_01                      :  0:00:07 :  0:00:07   :  restartcheck :  poroElasticCoupling/poroElasticCoupling_01/poroElasticCoupling_01.data                                  
                :                                              :          :            :               :  poroElasticCoupling/poroElasticCoupling_01/poroElasticCoupling_01.err                                   
     FAIL CHECK :  poroElasticCoupling_02                      :  0:00:09 :  0:00:19   :  restartcheck :  poroElasticCoupling/poroElasticCoupling_02/poroElasticCoupling_02.data                                  
                :                                              :          :            :               :  poroElasticCoupling/poroElasticCoupling_02/poroElasticCoupling_02.err                                   
     FAIL CHECK :  poroElasticCoupling_07                      :  0:00:11 :  0:01:17   :  restartcheck :  poroElasticCoupling/poroElasticCoupling_07/poroElasticCoupling_07.data                                  
                :                                              :          :            :               :  poroElasticCoupling/poroElasticCoupling_07/poroElasticCoupling_07.err                                   
     FAIL CHECK :  beamBending_01                              :  0:00:10 :  0:00:10   :  restartcheck :  solidMechanicsSSLE/beamBending_01/beamBending_01.data                                                   
                :                                              :          :            :               :  solidMechanicsSSLE/beamBending_01/beamBending_01.err                                                    
     FAIL CHECK :  beamBending_08                              :  0:00:10 :  0:01:26   :  restartcheck :  solidMechanicsSSLE/beamBending_08/beamBending_08.data                                                   
                :                                              :          :            :               :  solidMechanicsSSLE/beamBending_08/beamBending_08.err                                                    
     FAIL CHECK :  beamBending_27                              :  0:00:18 :  0:08:09   :  restartcheck :  solidMechanicsSSLE/beamBending_27/beamBending_27.data                                                   
                :                                              :          :            :               :  solidMechanicsSSLE/beamBending_27/beamBending_27.err                                                    
     FAIL CHECK :  DryFrac_ThreeNodesPinched_SlantFrac_01      :  0:00:06 :  0:00:06   :  restartcheck :  SurfaceGenerator/DryFrac_ThreeNodesPinched_SlantFrac_01/DryFrac_ThreeNodesPinched_SlantFrac_01.data     
                :                                              :          :            :               :  SurfaceGenerator/DryFrac_ThreeNodesPinched_SlantFrac_01/DryFrac_ThreeNodesPinched_SlantFrac_01.err      
     FAIL CHECK :  DryFrac_ThreeNodesPinched_SlantFrac_08      :  0:00:12 :  0:01:40   :  restartcheck :  SurfaceGenerator/DryFrac_ThreeNodesPinched_SlantFrac_08/DryFrac_ThreeNodesPinched_SlantFrac_08.data     
                :                                              :          :            :               :  SurfaceGenerator/DryFrac_ThreeNodesPinched_SlantFrac_08/DryFrac_ThreeNodesPinched_SlantFrac_08.err      
     ---------- :  ------------------------------------------- :  ------- :  --------- :  ------------ :  ————————————————————————————————————————————————————

For most of them (i.e. not SurfaceGenerator/SurfaceGenerator_08, SurfaceGenerator/DryFrac_ThreeNodesPinched_HorizontalFrac_08 and SurfaceGenerator/DryFrac_ThreeNodesPinched_SlantFrac_08, see other issue https://github.com/GEOSX/GEOSX/issues/629), it appears that the convergence required by the test is too strong.

Here is an example

> srun -n 1 /home/j0436735/.local/virtualenvs/integrated-tests/bin/python -m mpi4py /lustre4/scratch/j0436735/src-gnu/GEOSX/integratedTests/geosxats/helpers/restartcheck.py -a 1e-08 -r 2e-10 -w /lustre4/scratch/j0436735/src-gnu/GEOSX/integratedTests/update/run/SurfaceGenerator/DryFrac_StaticPenny_PrismElem_01/DryFrac_StaticPenny_PrismElem_restart_[0-9]+\.root /lustre4/scratch/j0436735/src-gnu/GEOSX/integratedTests/update/run/SurfaceGenerator/baselines/DryFrac_StaticPenny_PrismElem_01/DryFrac_StaticPenny_PrismElem_restart_[0-9]+\.root
Comparison of file /lustre4/scratch/j0436735/src-gnu/GEOSX/integratedTests/update/run/SurfaceGenerator/DryFrac_StaticPenny_PrismElem_01/DryFrac_StaticPenny_PrismElem_restart_000000002.root from pattern /lustre4/scratch/j0436735/src-gnu/GEOSX/integratedTests/update/run/SurfaceGenerator/DryFrac_StaticPenny_PrismElem_01/DryFrac_StaticPenny_PrismElem_restart_[0-9]+.root
Baseline file /lustre4/scratch/j0436735/src-gnu/GEOSX/integratedTests/update/run/SurfaceGenerator/baselines/DryFrac_StaticPenny_PrismElem_01/DryFrac_StaticPenny_PrismElem_restart_000000002.root from pattern /lustre4/scratch/j0436735/src-gnu/GEOSX/integratedTests/update/run/SurfaceGenerator/baselines/DryFrac_StaticPenny_PrismElem_01/DryFrac_StaticPenny_PrismElem_restart_[0-9]+.root
Relative tolerance: 2e-10
Absolute tolerance: 1e-08
Output file: /lustre4/scratch/j0436735/src-gnu/GEOSX/integratedTests/update/run/SurfaceGenerator/DryFrac_StaticPenny_PrismElem_01/DryFrac_StaticPenny_PrismElem_restart_000000002.restartcheck
Excluded groups: ['.*/commandLine', '.*/schema$', '.*/globalToLocalMap']
Warnings are errors: True

The root files are similar.
/lustre4/scratch/j0436735/src-gnu/GEOSX/integratedTests/geosxats/helpers/restartcheck.py:227: RuntimeWarning: invalid value encountered in divide
  relative_difference = difference / abs_base_arr
/lustre4/scratch/j0436735/src-gnu/GEOSX/integratedTests/geosxats/helpers/restartcheck.py:227: RuntimeWarning: divide by zero encountered in divide
  relative_difference = difference / abs_base_arr
/lustre4/scratch/j0436735/src-gnu/GEOSX/integratedTests/geosxats/helpers/restartcheck.py:237: RuntimeWarning: overflow encountered in divide
  relative_difference /= self.rtol

Rank 0 is comparing /lustre4/scratch/j0436735/src-gnu/GEOSX/integratedTests/update/run/SurfaceGenerator/DryFrac_StaticPenny_PrismElem_01/DryFrac_StaticPenny_PrismElem_restart_000000002/DryFrac_StaticPenny_PrismElem_restart_000000002_0000000.hdf5 with /lustre4/scratch/j0436735/src-gnu/GEOSX/integratedTests/update/run/SurfaceGenerator/baselines/DryFrac_StaticPenny_PrismElem_01/DryFrac_StaticPenny_PrismElem_restart_000000002/DryFrac_StaticPenny_PrismElem_restart_000000002_0000000.hdf5 
********************************************************************************
Error: /datagroup_0000000/sidre/external/Problem/domain/MeshBodies/mesh1/Level0/ElementRegions/elementRegionsGroup/Region2/elementSubRegions/cb1/granite/DeviatorStress
        Arrays of types float64 and float64 have 497664 values of which 3 fail both the relative and absolute tests.
                Max absolute difference is at index (184683,): value = -1644.5834970017895, base_value = -1644.5834970180877
                Max relative difference is at index (289,): value = 1.1368683772161603e-13, base_value = 0.0
        Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 3
                max = 1.3969838619232178, mean = 1.1796752611796062, std = 0.1582943993804599
        Statistics of the q values greater than 1.0 defined by relative tolerance: N = 0
********************************************************************************
********************************************************************************
Error: /datagroup_0000000/sidre/external/Problem/domain/MeshBodies/mesh1/Level0/ElementRegions/elementRegionsGroup/Region2/elementSubRegions/cb1/DeviatorStress
        Arrays of types float64 and float64 have 497664 values of which 3 fail both the relative and absolute tests.
                Max absolute difference is at index (184683,): value = -1644.5834970017895, base_value = -1644.5834970180877
                Max relative difference is at index (289,): value = 1.1368683772161603e-13, base_value = 0.0
        Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 3
                max = 1.3969838619232178, mean = 1.1796752611796062, std = 0.1582943993804599
        Statistics of the q values greater than 1.0 defined by relative tolerance: N = 0
********************************************************************************
The files are different.

Compared 1 pairs of files of which 1 are different.
        /lustre4/scratch/j0436735/src-gnu/GEOSX/integratedTests/update/run/SurfaceGenerator/DryFrac_StaticPenny_PrismElem_01/DryFrac_StaticPenny_PrismElem_restart_000000002/DryFrac_StaticPenny_PrismElem_restart_000000002_0000000.hdf5 and /lustre4/scratch/j0436735/src-gnu/GEOSX/integratedTests/update/run/SurfaceGenerator/baselines/DryFrac_StaticPenny_PrismElem_01/DryFrac_StaticPenny_PrismElem_restart_000000002/DryFrac_StaticPenny_PrismElem_restart_000000002_0000000.hdf5
Rank 0 [Tue Oct 29 16:03:40 2019] [c0-0c0s4n1] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
srun: error: nid00017: task 0: Aborted
srun: Terminating job step 631408.0

If the absolute tolerance is modified from 1e-08 to 1e-07, then comparison is OK. Except for the 3 other cases mentioned above, we are in the same case/pattern.

What is your point of view on this?

Thanks,

PS: Do you know about the python RuntimeWarning?

WuHuiLLNL commented 4 years ago

I think this is fine. As you said, this is caused by a strong absolute tolerance.

TotoGaz commented 4 years ago

I do agree with you @WuHuiLLNL, I do not worry too much about this. It was also a way to report and track the issue.

corbett5 commented 4 years ago

The python warning can be ignored, it's handled in the script. As for the tolerances those look like the same tests that are failing on our GPU system (even when run CPU only). On our system some of the diffs are pretty inconsequential like the one you posted above but others are huge. I believe @rrsettgast is looking into it.

AntoineMazuyer commented 4 years ago

Problem is still here, for instance for beam_bending :

********************************************************************************
Error: /Problem/domain/MeshBodies/mesh1/Level0/ElementRegions/elementRegionsGroup/Region2/elementSubRegions/cb1/shale/stress
        Arrays of types float64 and float64 have 122880 values of which 48166 fail both the relative and absolute tests.
                Max absolute difference is at index (28, 2, 0): value = -604471801.6809957, base_value = -604471801.6808847
                Max relative difference is at index (120, 2, 3): value = -1.862645149230957e-08, base_value = 0.0
        Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 40043
                max = 10490.41748046875, mean = 649.5192945775137, std = 1021.0417646132936
        Statistics of the q values greater than 1.0 defined by relative tolerance: N = 8123
                max = 1964.881363738629, mean = 8.174558951984588, std = 59.06163032902932
********************************************************************************
********************************************************************************
Error: /Problem/domain/MeshBodies/mesh1/Level0/ElementRegions/elementRegionsGroup/Region2/elementSubRegions/cb1/stress
        Arrays of types float64 and float64 have 122880 values of which 48166 fail both the relative and absolute tests.
                Max absolute difference is at index (28, 2, 0): value = -604471801.6809957, base_value = -604471801.6808847
                Max relative difference is at index (120, 2, 3): value = -1.862645149230957e-08, base_value = 0.0
        Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 40043
                max = 10490.41748046875, mean = 649.5192945775137, std = 1021.0417646132936
        Statistics of the q values greater than 1.0 defined by relative tolerance: N = 8123
                max = 1964.881363738629, mean = 8.174558951984588, std = 59.06163032902932
********************************************************************************
AntoineMazuyer commented 4 years ago

I am trying to work on this issue, but some differences are not negligible on develop, e.g. for compositional_multiphase_wells_1d_01 :

 Error: /Problem/domain/MeshBodies/mesh1/Level0/ElementRegions/elementRegionsGroup/Region1/elementSubRegions/cb1/dPhaseDensity_dGlobalCompFraction
         Arrays of types float64 and float64 have 40 values of which 1 fail both the relative and absolute tests.
                 Max absolute difference is at index (2, 0, 0, 3): value = 3360.6991367299497, base_value = 3360.33293114939
                 Max relative difference is at index (2, 0, 1, 0): value = -0.0012330285399880093, base_value = -0.0018495428099812644
         Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 0
         Statistics of the q values greater than 1.0 defined by relative tolerance: N = 1
                 max = 2.3148148232806087, mean = 2.3148148232806087, std = 0.0

It seems to happen for all the dPhaseDensity_dGlobalCompFraction, even in other tests.

I don't want to have a tolerance of 1e-1.

Thoughts ?

corbett5 commented 4 years ago

@AntoineMazuyer do you get a diff even after rebasing and running the tests again?

AntoineMazuyer commented 4 years ago

@corbett5 I don't want to rebase it because it will just transfer the problem to the next one.

AntoineMazuyer commented 4 years ago

btw, I noticed for this specific test :

 76 Attempt: 0, Newton: 0, R = 7.53007e-05
 77 WARNING: Rachford-Rice Newton reached max number of iterations
 78 WARNING: Rachford-Rice Newton reached max number of iterations
 79 WARNING: Rachford-Rice Newton reached max number of iterations
 80 WARNING: Rachford-Rice Newton reached max number of iterations
 81 WARNING: Rachford-Rice Newton reached max number of iterations
 82 WARNING: Rachford-Rice Newton reached max number of iterations
 83 WARNING: Rachford-Rice Newton reached max number of iterations
joshua-white commented 4 years ago

Possible overlap with #840 and #766