GEOS-DEV / GEOS

GEOS Simulation Framework
GNU Lesser General Public License v2.1
221 stars 89 forks source link

HF simulation resumed from a restart file has weird matrix sizes #1004

Open cssherman opened 4 years ago

cssherman commented 4 years ago

Describe the bug While testing a large-scale hydraulic fracture model, I encountered an error after resuming from a restart file. The code appeared to resume without any apparent errors. During the next application of the HF solver, the numerical solution appeared to be going fine and identified faces that needed to be opened. After this point however, the solver reported the following error:

ERROR: Amesos NumericFactorization failed... dumping relevant matrix for post-mortem

To Reproduce I found this error while working with one of the new HF examples in #990 . Some key pieces of information:

I'll try to find a smaller version of this problem that shows this error, and play around with different partitioning. The following are segments of the log file and matrix dump file:

Log file after restart: Max threads: 2 MKL max threads: 2 real64 is alias of double localIndex is alias of long globalIndex is alias of long long Loading restart file /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055 Rank 0: rankFilePattern = /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart000000055/rank%07d.hdf5 Rank 0: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000000.hdf5 Rank 1: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000001.hdf5 Rank 2: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000002.hdf5 Rank 3: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000003.hdf5 Rank 4: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000004.hdf5 Rank 5: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000005.hdf5 Rank 7: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000007.hdf5 Rank 9: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000009.hdf5 Rank 10: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000010.hdf5 Rank 12: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000012.hdf5 Rank 13: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000013.hdf5 Rank 14: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000014.hdf5 Rank 15: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000015.hdf5 Rank 17: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000017.hdf5 Rank 19: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000019.hdf5 Rank 20: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000020.hdf5 Rank 6: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000006.hdf5 Rank 8: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000008.hdf5 Rank 11: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000011.hdf5 Rank 16: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000016.hdf5 Rank 18: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000018.hdf5 Rank 21: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000021.hdf5 Rank 22: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000022.hdf5 Rank 23: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000023.hdf5 Rank 24: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000024.hdf5 Rank 25: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000025.hdf5 Rank 26: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000026.hdf5 Rank 27: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000027.hdf5 Rank 28: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000028.hdf5 Rank 29: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000029.hdf5 Rank 30: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000030.hdf5 Rank 31: Reading in restart file at /p/lustre2/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/tmp_restart_000000055/rank_0000031.hdf5 GEOS must be configured to use Python to use parameters, symbolic math, etc. in input files Adding Mesh: InternalMesh, mesh1 Adding Geometric Object: Box, perf_a Adding Geometric Object: Box, source_a Adding Geometric Object: ThickPlane, fracturable_a Adding Solver of type Hydrofracture, named hydrofracture Adding Solver of type SolidMechanicsLagrangianSSLE, named lagsolve Adding Solver of type SinglePhaseFVM, named SinglePhaseFlow Adding Solver of type SurfaceGenerator, named SurfaceGen Adding Output: Silo, siloOutput Adding Output: Restart, restartOutput Adding Event: SoloEvent, preFracture Adding Event: PeriodicEvent, outputs Adding Event: PeriodicEvent, solverApplications_a Adding Event: PeriodicEvent, solverApplications_b Adding Event: PeriodicEvent, solverApplications_c Adding Event: HaltEvent, restarts TableFunction: flow_rate TableFunction: sigma_xx TableFunction: sigma_yy TableFunction: sigma_zz TableFunction: init_pressure TableFunction: bulk_modulus TableFunction: shear_modulus Adding Object CellElementRegion named Region2 from ObjectManager::Catalog. Adding Object FaceElementRegion named Fracture from ObjectManager::Catalog. Running simulation The restart-file was written during step 5 of the event loop. Resuming from that point. Time: 184s, dt:4s, Cycle: 55 Time: 188s, dt:2s, Cycle: 56 Attempt: 0, NewtonIter: 0 ; ( Rfluid, Rsolid ) = (4.84e-01, 1.99e-13) ;

Attempt:  0, NewtonIter:  1 ; 

( Rfluid, Rsolid ) = (5.40e-02, 4.00e-09) ; Last LinSolve(iter,res) = ( 0, 0.00e+00) ;

Attempt:  0, NewtonIter:  2 ; 

( Rfluid, Rsolid ) = (2.86e-04, 4.84e-10) ; Last LinSolve(iter,res) = ( 0, 0.00e+00) ;

Attempt:  0, NewtonIter:  3 ; 

( Rfluid, Rsolid ) = (5.47e-07, 2.09e-12) ; Last LinSolve(iter,res) = ( 0, 0.00e+00) ; ++ Fracture propagation. Re-entering Newton Solve. Attempt: 0, NewtonIter: 0 ; ( Rfluid, Rsolid ) = (1.30e-02, 1.37e-01) ; ERROR: Amesos NumericFactorization failed... dumping relevant matrix for post-mortem

Attempt:  0, NewtonIter:  1 ; 

( Rfluid, Rsolid ) = (1.10e-02, 1.37e-01) ; Last LinSolve(iter,res) = ( 0, 0.00e+00) ;

ERROR: Amesos NumericFactorization failed... dumping relevant matrix for post-mortem Attempt: 0, NewtonIter: 2 ; ( Rfluid, Rsolid ) = (1.10e-02, 1.37e-01) ; Last LinSolve(iter,res) = ( 0, 0.00e+00) ;

Last few lines from the file amesos-failure.dat: 79 78 -2.3811047519983836e+07 80 68 -3.3696905051534642e+06 80 63 -6.6849689075721288e+06 80 64 -3.3152784024186642e+06 80 71 -1.4833076419058008e+118 80 77 -3.3696905051534586e+06 80 72 -1.4833076419058008e+118 80 80 -1.4833076419058020e+118 80 81 -1.4833076419058020e+118 81 63 -1.0152750673919858e+07 81 64 -2.0355044939432088e+07 81 69 -1.0202294265512232e+07 81 71 -1.1691518144325470e+118 81 72 -1.1691518144325470e+118 81 78 -1.0202294265512213e+07 81 80 -1.1691518144325479e+118 81 81 -1.1691518144325479e+118

andrea-franceschini commented 4 years ago

The fracture is propagating along a partition boundary

Can it be related to #1000?

joshua-white commented 4 years ago

Most likely a bad matrix is being assembled (either singular or with crazy entries) and the direct solver is failing. In amesos-failure.dat, it looks like you have some matrix entries with order 1e118, which is probably not good. Can you confirm we are getting a funny matrix before the failing solve? We may then be able to identify which elements have bad local element matrices (i.e. ones with huge entries) by doing a check during assembly. I suspect these will be the fracture tip elements.

cssherman commented 4 years ago

Most likely a bad matrix is being assembled (either singular or with crazy entries) and the direct solver is failing. In amesos-failure.dat, it looks like you have some matrix entries with order 1e118, which is probably not good. Can you confirm we are getting a funny matrix before the failing solve? We may then be able to identify which elements have bad local element matrices (i.e. ones with huge entries) by doing a check during assembly. I suspect these will be the fracture tip elements.

Agreed. I assume that I can get these by setting the logLevel high enough?

cssherman commented 4 years ago

The fracture is propagating along a partition boundary

Can it be related to #1000?

Odd things do tend to happen along the boundary... I'm running a cases that changes the partitioning to x=8, y=1, z=4 to see if the issue occurs there as well.

cssherman commented 4 years ago

The fracture is propagating along a partition boundary

Can it be related to #1000?

Odd things do tend to happen along the boundary... I'm running a cases that changes the partitioning to x=8, y=1, z=4 to see if the issue occurs there as well.

Running the same problem, but with the partitioning swapped, the solution gets past the first post-restart timestep. However, after a couple of additional steps, it encounters a separate error within EpetraMatrix:

Log File

Time: 200s, dt:4s, Cycle: 60
Time: 204s, dt:4s, Cycle: 61
    Attempt:  0, NewtonIter:  0 ; 
( Rfluid, Rsolid ) = (1.31e+00, 9.14e-12) ; 

    Attempt:  0, NewtonIter:  1 ; 
( Rfluid, Rsolid ) = (2.83e-01, 1.15e-08) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  2 ; 
( Rfluid, Rsolid ) = (4.95e-03, 2.45e-09) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  3 ; 
( Rfluid, Rsolid ) = (1.65e-06, 6.22e-11) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
++ Fracture propagation. Re-entering Newton Solve.
    Attempt:  0, NewtonIter:  0 ; 
( Rfluid, Rsolid ) = (4.71e-02, 2.03e-02) ; 

    Attempt:  0, NewtonIter:  1 ; 
( Rfluid, Rsolid ) = (1.64e-01, 1.46e-08) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (2.77e-02, 1.01e-02) ; 

    Attempt:  0, NewtonIter:  2 ; 
( Rfluid, Rsolid ) = (4.56e-02, 7.14e-09) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (1.88e-02, 5.07e-03) ; 

    Attempt:  0, NewtonIter:  3 ; 
( Rfluid, Rsolid ) = (3.83e-02, 3.53e-09) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (1.38e-02, 2.53e-03) ; 

    Attempt:  0, NewtonIter:  4 ; 
( Rfluid, Rsolid ) = (2.26e-03, 2.36e-09) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  5 ; 
( Rfluid, Rsolid ) = (4.16e-06, 2.19e-11) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
Time: 208s, dt:2s, Cycle: 62
    Attempt:  0, NewtonIter:  0 ; 
( Rfluid, Rsolid ) = (2.77e-01, 2.19e-11) ; 

    Attempt:  0, NewtonIter:  1 ; 
( Rfluid, Rsolid ) = (1.88e-02, 3.23e-09) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  2 ; 
( Rfluid, Rsolid ) = (3.88e-05, 1.79e-10) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  3 ; 
( Rfluid, Rsolid ) = (5.56e-07, 4.83e-13) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
hydrofracture: Newton solver converged in less than 20 iterations, time-step required will be doubled.
Time: 210s, dt:4s, Cycle: 63
    Attempt:  0, NewtonIter:  0 ; 
( Rfluid, Rsolid ) = (3.47e-01, 4.83e-13) ; 

    Attempt:  0, NewtonIter:  1 ; 
( Rfluid, Rsolid ) = (1.27e-02, 4.34e-09) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  2 ; 
( Rfluid, Rsolid ) = (4.81e-05, 1.22e-10) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  3 ; 
( Rfluid, Rsolid ) = (5.74e-07, 4.89e-13) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
hydrofracture: Newton solver converged in less than 20 iterations, time-step required will be doubled.
Time: 214s, dt:4s, Cycle: 64
    Attempt:  0, NewtonIter:  0 ; 
( Rfluid, Rsolid ) = (3.20e-01, 4.88e-13) ; 

    Attempt:  0, NewtonIter:  1 ; 
( Rfluid, Rsolid ) = (5.45e-03, 3.32e-09) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  2 ; 
( Rfluid, Rsolid ) = (2.37e-06, 5.32e-11) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
++ Fracture propagation. Re-entering Newton Solve.
    Attempt:  0, NewtonIter:  0 ; 
( Rfluid, Rsolid ) = (9.23e-01, 1.48e-01) ; 

    Attempt:  0, NewtonIter:  1 ; 
( Rfluid, Rsolid ) = (7.86e-01, 7.83e-03) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  2 ; 
( Rfluid, Rsolid ) = (9.22e-03, 1.04e-08) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  3 ; 
( Rfluid, Rsolid ) = (4.58e-04, 7.83e-03) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  4 ; 
( Rfluid, Rsolid ) = (3.37e-04, 7.83e-03) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  5 ; 
( Rfluid, Rsolid ) = (5.00e-05, 6.69e-09) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  6 ; 
( Rfluid, Rsolid ) = (3.22e-04, 7.83e-03) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (3.22e-04, 7.83e-03) ; 
        Line search @ 0.250:      ( Rfluid, Rsolid ) = (3.22e-04, 7.83e-03) ; 
        Line search @ 0.125:      ( Rfluid, Rsolid ) = (3.23e-04, 7.83e-03) ; 
        Line search @ 0.062:      ( Rfluid, Rsolid ) = (3.23e-04, 7.83e-03) ; 
        Line search failed to produce reduced residual. Accepting iteration.

    Attempt:  0, NewtonIter:  7 ; 
( Rfluid, Rsolid ) = (3.28e-04, 7.83e-03) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (3.61e-05, 1.98e-10) ; 

    Attempt:  0, NewtonIter:  8 ; 
( Rfluid, Rsolid ) = (1.62e-04, 3.92e-03) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (1.62e-04, 3.92e-03) ; 
        Line search @ 0.250:      ( Rfluid, Rsolid ) = (1.62e-04, 3.92e-03) ; 
        Line search @ 0.125:      ( Rfluid, Rsolid ) = (1.62e-04, 3.92e-03) ; 
        Line search @ 0.062:      ( Rfluid, Rsolid ) = (1.63e-04, 3.92e-03) ; 
        Line search failed to produce reduced residual. Accepting iteration.

    Attempt:  0, NewtonIter:  9 ; 
( Rfluid, Rsolid ) = (1.63e-04, 3.92e-03) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (2.01e-05, 1.69e-09) ; 

    Attempt:  0, NewtonIter: 10 ; 
( Rfluid, Rsolid ) = (8.15e-05, 1.96e-03) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (8.11e-05, 1.96e-03) ; 
        Line search @ 0.250:      ( Rfluid, Rsolid ) = (8.13e-05, 1.96e-03) ; 
        Line search @ 0.125:      ( Rfluid, Rsolid ) = (8.15e-05, 1.96e-03) ; 
        Line search @ 0.062:      ( Rfluid, Rsolid ) = (8.16e-05, 1.96e-03) ; 
        Line search failed to produce reduced residual. Accepting iteration.

    Attempt:  0, NewtonIter: 11 ; 
( Rfluid, Rsolid ) = (8.18e-05, 1.96e-03) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (1.03e-05, 1.60e-09) ; 

    Attempt:  0, NewtonIter: 12 ; 
( Rfluid, Rsolid ) = (4.10e-05, 9.79e-04) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (4.06e-05, 9.79e-04) ; 
        Line search @ 0.250:      ( Rfluid, Rsolid ) = (4.06e-05, 9.79e-04) ; 
        Line search @ 0.125:      ( Rfluid, Rsolid ) = (4.08e-05, 9.79e-04) ; 
        Line search @ 0.062:      ( Rfluid, Rsolid ) = (4.08e-05, 9.79e-04) ; 
        Line search failed to produce reduced residual. Accepting iteration.

    Attempt:  0, NewtonIter: 13 ; 
( Rfluid, Rsolid ) = (4.10e-05, 9.79e-04) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (4.96e-06, 1.16e-09) ; 

    Attempt:  0, NewtonIter: 14 ; 
( Rfluid, Rsolid ) = (2.06e-05, 4.89e-04) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (2.04e-05, 4.89e-04) ; 
        Line search @ 0.250:      ( Rfluid, Rsolid ) = (2.03e-05, 4.89e-04) ; 
        Line search @ 0.125:      ( Rfluid, Rsolid ) = (2.03e-05, 4.89e-04) ; 
        Line search @ 0.062:      ( Rfluid, Rsolid ) = (2.03e-05, 4.89e-04) ; 
        Line search failed to produce reduced residual. Accepting iteration.

    Attempt:  0, NewtonIter: 15 ; 
( Rfluid, Rsolid ) = (2.07e-05, 4.89e-04) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (2.46e-06, 7.41e-10) ; 

    Attempt:  0, NewtonIter: 16 ; 
( Rfluid, Rsolid ) = (1.00e-05, 2.45e-04) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (1.00e-05, 2.45e-04) ; 
        Line search @ 0.250:      ( Rfluid, Rsolid ) = (1.03e-05, 2.45e-04) ; 
        Line search @ 0.125:      ( Rfluid, Rsolid ) = (1.03e-05, 2.45e-04) ; 
        Line search @ 0.062:      ( Rfluid, Rsolid ) = (1.04e-05, 2.45e-04) ; 
        Line search failed to produce reduced residual. Accepting iteration.

    Attempt:  0, NewtonIter: 17 ; 
( Rfluid, Rsolid ) = (1.03e-05, 2.45e-04) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter: 18 ; 
( Rfluid, Rsolid ) = (5.81e-07, 2.04e-10) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
Time: 218s, dt:2s, Cycle: 65
    Attempt:  0, NewtonIter:  0 ; 
( Rfluid, Rsolid ) = (5.35e-01, 2.04e-10) ; 

    Attempt:  0, NewtonIter:  1 ; 
( Rfluid, Rsolid ) = (2.99e-02, 2.45e-04) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  2 ; 
( Rfluid, Rsolid ) = (7.40e-04, 2.45e-04) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  3 ; 
( Rfluid, Rsolid ) = (3.09e-05, 1.67e-10) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  4 ; 
( Rfluid, Rsolid ) = (7.74e-06, 2.45e-04) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (7.81e-06, 2.45e-04) ; 
        Line search @ 0.250:      ( Rfluid, Rsolid ) = (1.55e-05, 2.45e-04) ; 
        Line search @ 0.125:      ( Rfluid, Rsolid ) = (1.94e-05, 2.45e-04) ; 
        Line search @ 0.062:      ( Rfluid, Rsolid ) = (2.13e-05, 2.45e-04) ; 
        Line search failed to produce reduced residual. Accepting iteration.

    Attempt:  0, NewtonIter:  5 ; 
( Rfluid, Rsolid ) = (7.85e-06, 2.45e-04) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  6 ; 
( Rfluid, Rsolid ) = (5.87e-07, 2.34e-10) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
hydrofracture: Newton solver converged in less than 20 iterations, time-step required will be doubled.
Time: 220s, dt:4s, Cycle: 66
    Attempt:  0, NewtonIter:  0 ; 
( Rfluid, Rsolid ) = (3.59e-01, 2.34e-10) ; 

    Attempt:  0, NewtonIter:  1 ; 
( Rfluid, Rsolid ) = (1.25e-02, 2.45e-04) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  2 ; 
( Rfluid, Rsolid ) = (1.14e-03, 2.45e-04) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  3 ; 
( Rfluid, Rsolid ) = (1.66e-05, 1.79e-10) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  4 ; 
( Rfluid, Rsolid ) = (1.59e-05, 2.45e-04) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (7.59e-06, 2.45e-04) ; 
        Line search @ 0.250:      ( Rfluid, Rsolid ) = (3.48e-06, 2.45e-04) ; 
        Line search @ 0.125:      ( Rfluid, Rsolid ) = (1.46e-06, 2.45e-04) ; 
        Line search @ 0.062:      ( Rfluid, Rsolid ) = (6.18e-07, 2.45e-04) ; 
        Line search failed to produce reduced residual. Accepting iteration.

    Attempt:  0, NewtonIter:  5 ; 
( Rfluid, Rsolid ) = (1.61e-05, 2.45e-04) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (7.84e-06, 6.62e-12) ; 

    Attempt:  0, NewtonIter:  6 ; 
( Rfluid, Rsolid ) = (7.93e-06, 1.22e-04) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (4.02e-06, 1.22e-04) ; 
        Line search @ 0.250:      ( Rfluid, Rsolid ) = (2.08e-06, 1.22e-04) ; 
        Line search @ 0.125:      ( Rfluid, Rsolid ) = (1.14e-06, 1.22e-04) ; 
        Line search @ 0.062:      ( Rfluid, Rsolid ) = (6.99e-07, 1.22e-04) ; 
        Line search failed to produce reduced residual. Accepting iteration.

    Attempt:  0, NewtonIter:  7 ; 
( Rfluid, Rsolid ) = (8.14e-06, 1.22e-04) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (3.63e-06, 4.58e-11) ; 

    Attempt:  0, NewtonIter:  8 ; 
( Rfluid, Rsolid ) = (3.95e-06, 6.12e-05) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (2.16e-06, 6.12e-05) ; 
        Line search @ 0.250:      ( Rfluid, Rsolid ) = (1.29e-06, 6.12e-05) ; 
        Line search @ 0.125:      ( Rfluid, Rsolid ) = (8.84e-07, 6.12e-05) ; 
        Line search @ 0.062:      ( Rfluid, Rsolid ) = (7.02e-07, 6.12e-05) ; 
        Line search failed to produce reduced residual. Accepting iteration.

    Attempt:  0, NewtonIter:  9 ; 
( Rfluid, Rsolid ) = (4.13e-06, 6.12e-05) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (1.50e-06, 4.35e-11) ; 

    Attempt:  0, NewtonIter: 10 ; 
( Rfluid, Rsolid ) = (2.03e-06, 3.06e-05) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (1.32e-06, 3.06e-05) ; 
        Line search @ 0.250:      ( Rfluid, Rsolid ) = (9.94e-07, 3.06e-05) ; 
        Line search @ 0.125:      ( Rfluid, Rsolid ) = (8.00e-07, 3.06e-05) ; 
        Line search @ 0.062:      ( Rfluid, Rsolid ) = (7.34e-07, 3.06e-05) ; 
        Line search failed to produce reduced residual. Accepting iteration.

    Attempt:  0, NewtonIter: 11 ; 
( Rfluid, Rsolid ) = (1.95e-06, 3.06e-05) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (8.70e-07, 3.15e-11) ; 

    Attempt:  0, NewtonIter: 12 ; 
( Rfluid, Rsolid ) = (1.03e-06, 1.53e-05) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (6.66e-07, 1.53e-05) ; 
        Line search @ 0.250:      ( Rfluid, Rsolid ) = (5.31e-07, 1.53e-05) ; 
        Line search @ 0.125:      ( Rfluid, Rsolid ) = (4.95e-07, 1.53e-05) ; 
        Line search @ 0.062:      ( Rfluid, Rsolid ) = (4.89e-07, 1.53e-05) ; 
        Line search failed to produce reduced residual. Accepting iteration.

    Attempt:  0, NewtonIter: 13 ; 
( Rfluid, Rsolid ) = (1.09e-06, 1.53e-05) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
        Line search @ 0.500:      ( Rfluid, Rsolid ) = (6.50e-07, 2.02e-11) ; 

    Attempt:  0, NewtonIter: 14 ; 
( Rfluid, Rsolid ) = (5.29e-07, 7.65e-06) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
++ Fracture propagation. Re-entering Newton Solve.
    Attempt:  0, NewtonIter:  0 ; 
( Rfluid, Rsolid ) = (8.34e-01, 6.51e-02) ; 

    Attempt:  0, NewtonIter:  1 ; 
( Rfluid, Rsolid ) = (3.00e-01, 7.65e-06) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  2 ; 
( Rfluid, Rsolid ) = (6.81e-03, 3.64e-09) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  3 ; 
( Rfluid, Rsolid ) = (1.51e-04, 7.65e-06) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  4 ; 
( Rfluid, Rsolid ) = (6.46e-06, 7.65e-06) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 

    Attempt:  0, NewtonIter:  5 ; 
( Rfluid, Rsolid ) = (5.63e-07, 2.75e-10) ; Last LinSolve(iter,res) = (  0, 0.00e+00) ; 
++ Fracture propagation. Re-entering Newton Solve.
    Attempt:  0, NewtonIter:  0 ; 
***** ERROR
***** LOCATION: /usr/WS2/sherman/GEOSX/src/coreComponents/linearAlgebra/interfaces/trilinos/EpetraMatrix.cpp:542
***** Controlling expression (should be false): globalRow >= iupper()
***** Rank 1: 
Expected globalRow < iupper()
  globalRow = 44646
  iupper() = 15681

** StackTrace of 11 frames **
Frame 1: geosx::EpetraMatrix::clearRow(long long, bool, double)
Frame 2: 
Frame 3: geosx::HydrofractureSolver::ApplyBoundaryConditions(double, double, geosx::DomainPartition*, geosx::DofManager const&, geosx::EpetraMatrix&, geosx::EpetraVector&)
Frame 4: geosx::SolverBase::NonlinearImplicitStep(double const&, double const&, int, geosx::DomainPartition*, geosx::DofManager const&, geosx::EpetraMatrix&, geosx::EpetraVector&, geosx::EpetraVector&)
Frame 5: geosx::HydrofractureSolver::SolverStep(double const&, double const&, int, geosx::DomainPartition*)
Frame 6: geosx::SolverBase::Execute(double, double, int, int, double, geosx::dataRepository::Group*)
Frame 7: geosx::EventBase::Execute(double, double, int, int, double, geosx::dataRepository::Group*)
Frame 8: geosx::EventManager::Run(geosx::dataRepository::Group*)
Frame 9: main
Frame 10: __libc_start_main
Frame 11: /g/g17/sherman/GEOS/geosx/GEOSX/build-quartz-clang@9.0.0-release/bin/geosx() [0x402a10]
joshua-white commented 4 years ago

iupper() is one past the largest row index on the given rank (for rank 1, iupper=15681). It looks like we are trying to set a boundary condition to row 44646, which doesn't exist on rank 1. it definitely looks like some indexing is goofed up.

cssherman commented 4 years ago

Also, this time there happened to be a plot-file written on the timestep where the error occured. This shows the fracture aperture, with the domain partition numbers highlighted in the background:

image

rrsettgast commented 4 years ago

@cssherman A couple of observations from your last run.:

  1. The error is on the application of a kinematic boundary condition, and the error is occurring after a fracture, so the only place that these things are occurring is on the xneg boundary. Do you agree with this?
  2. the command line output rank 1 is actually displayed +1 in visit, so that is rank 2, which is in the green.
  3. If 1) and 2) are correct, I don't think this could be on a partition boundary?

Can you locate the offending node number associated with that row?

cssherman commented 4 years ago

@cssherman A couple of observations from your last run.:

1. The error is on the application of a kinematic boundary condition, and the error is occurring after a fracture, so the only place that these things are occurring is on the `xneg` boundary. Do you agree with this?

2. the command line output `rank 1` is actually displayed +1 in visit, so that is `rank 2`, which is in the green.

3. If 1) and 2) are correct, I don't think this could be on a partition boundary?

Can you locate the offending node number associated with that row?

That makes sense. The only BC's that are applied near the partition are the xneg-roller and the fluid source (the fracture growth is biased upwards in this model, so it isn't symmetric). I can try finding the node id with totalview. By the way, I've set the permissions to access this model in case you'd like to look at it:

/p/lscratchh/sherman/HFTS/GEOS_vs_GEOSX_benchmarks/single_fracture_clean_fluid/GEOSX_new/alternate

cssherman commented 4 years ago

@rrsettgast @joshua-white @af1990 -

I've pared down the example problem, modified the xml to output snapshots of the solution matrices, and re-ran the problem. As before, for the serial case, the initial and restart runs behave well. For the parallel case (ny = 2, fracture propagating along the boundary), the solution fails during restart runs following the first fracturing event (in this example, we restart at 50 s, and fracture at 90 s). Also, as to your previous suggestion, the error still occurs whether or not we apply a BC on xneg.

The error message this time around looks like this:

++ Fracture propagation. Re-entering Newton Solve.
    Attempt:  0, NewtonIter:  0 ; 
( Rfluid, Rsolid ) = (3.98e+00, 7.02e+00) ; 

        *******************************************************
        ***** Problem: Thyra::DefaultBlockedLinearOp<double>{numRowBlocks=2,numColBlocks=2}
        ***** Preconditioned GMRES solution
        ***** User-defined preconditioner
        ***** No scaling
        *******************************************************

                  iter:    0           residual = 1.000000e+00
                  iter:    1           residual = 9.999615e-01

    ***************************************************************

    Warning: the GMRES Hessenberg matrix is ill-conditioned.  This may 
    indicate that the application matrix is singular. In this case, GMRES
    may have a least-squares solution.

    Solver:         gmres
    number of iterations:   2

    Actual residual =  1.4672e+08   Recursive residual =  1.4672e+08

    Calculated Norms                Requested Norm
    --------------------------------------------    --------------

    ||r||_2 / ||r0||_2:     9.999615e-01    1.000000e-06

    ***************************************************************

Before the fracturing event, the silo files written for the initial/restart runs look visually identical. When looking at the solver matrices written by the code, I noticed that there was some odd behavior associated with the size of these files. (Note: the sizes of the files for the serial cases are all consistent with the initial parallel runs)

Before the fracturing event, the file headers are as follow:

File ,               Initial,                   Restart
matrix00_80,  393 393 20673 ,  393 393 20673
matrix01_80,  393 6 111,           393 6 111
matrix10_80,  6 393 225,           6 393 225
matrix11_80,  6 6 18,                 6 6 18
residual0_80,  393 1,                 393 1
residual1_80,  6 1,                     6 1

And after the fracturing event, the headers are as follow:

File,                Initial,                  Restart
matrix00_90,  399 399 21141,  399 399 21087
matrix01_90,  399 8 159,          399 8 183
matrix10_90,  8 399 351 ,         8 399 381
matrix11_90,  8 8 28,                8 8 28
residual0_90,  399 1,                399 1
residual1_90,  8 1,                    8 1

Interestingly, the total number of entries across matrix00, matrix01, matrix10 are the same between the initial/restart runs. However, there are fewer entries in matrix00 and more entries in matrix01 and matrix10 in the restart runs...

Any thoughts? Note: these were run using #990, which patches a small bug that can cause already-executed SoloEvents to modify the first timestep during the first cycle after restart.

I've attached a zip file with the runs here: tiny_hf_example.zip

TotoGaz commented 1 year ago

Hello @cssherman do you have any new element?

paveltomin commented 1 year ago

@cssherman any update?