erf-model / ERF

Energy Research and Forecasting Model
https://erf.readthedocs.io/en/latest/
Other
23 stars 39 forks source link

MOST test with terrain failing #1453

Closed indra098124 closed 8 months ago

indra098124 commented 8 months ago

Hi there, I tried most test provided in terrain3d_Hemisphere and WitchOfAgnesi. Both of these tests are failing for me. Are they expected to run from the initial condition defined in prob or we should run it without most first? I am using the latest version of the code and getting "SIGILL Invalid, privileged, or ill-formed instruction" error with these tests.

Many thanks for developing the code and answering my question.

asalmgren commented 8 months ago

Hi @indra098124 -- could you try with the inputs files in those directories and see if that works for you? Here https://ccse.lbl.gov/pub/RegressionTesting1/ERF/ is our nightly regression test suite -- all of these should "just work" if you try them-- maybe also try some of these as well so we can rule out issues, then we can see about this particular problem.

AMLattanzi commented 8 months ago

Are you running these tests locally on a mac?

indra098124 commented 8 months ago

Thank you @asalmgren and @AMLattanzi for looking into this. @AMLattanzi yes, I am running these locally on Mac. @asalmgren I can run ABL cases (that are also included in nightly tests) with no problem.

baperry2 commented 8 months ago

Set amrex.fpe_trap_invalid = 0 in the input files, which turns off some runtime error checking. The Apple Clang compilers sometimes perform optimizations that cause the AMReX checks for divide by zero and similar errors to spuriously fail (conditional branches that don't get used and involve a divide by zero may be still be evaluated). These optimizations aren't performed in debug mode, so if needed you can also run with amrex.fpe_trap_invalid = 1 if you compile with DEBUG = TRUE.

asalmgren commented 8 months ago

@baperry2 -- that's really good to know -- could you add that to the docs somewhere?!

indra098124 commented 8 months ago

Thanks @baperry2, I was not aware of this. I tried that but it did not help. I also tried to run this test on a Linux machine and I get an error "erroneous arithmetic operation" . Looking at Backtrace it appears that the error originates in MOST calculation "Source/BoundaryConditions/MOSTAverage.H:143:56"

Here is the code snippet where it fails. for (int n = 0; n < interp_comp; n++) interp_vals[n] = sx_lo[0]sx_lo[1]sx_lo[2]interp_array(i-1, j-1, k-1,n) + sx_lo[0]sx_lo[1]sx_hi[2]interp_array(i-1, j-1, k ,n) + sx_lo[0]sx_hi[1]sx_lo[2]interp_array(i-1, j , k-1,n) + sx_lo[0]sx_hi[1]sx_hi[2]interp_array(i-1, j , k ,n) + sx_hi[0]sx_lo[1]sx_lo[2]interp_array(i , j-1, k-1,n) + sx_hi[0]sx_lo[1]sx_hi[2]interp_array(i , j-1, k ,n) + sx_hi[0]sx_hi[1]sx_lo[2]interp_array(i , j , k-1,n) + sx_hi[0]sx_hi[1]sx_hi[2]interp_array(i , j , k ,n); }

baperry2 commented 8 months ago

@asalmgren will do, even though there appears to be more going on here, I definitely learned about the spurious FPEs on Macs the hard way and it would be good to have the information out there more.

@indra098124 - I tried again and see the same thing as you. For Witch of Agnesi, I see a spurious FPE that resolves with amrex.fpe_trap_invalid = 0 when running with inputs, but the same error as you when running with inputs_most_test, which appears to be a real error

AMLattanzi commented 8 months ago

@indra098124 Thank you for sharing the issue with inputs_most_test . The problem had to do with Theta_prim variable not having its ghost cells filled yet and the interpolation routine (where your backtrace points to) had to access that data. The following PR 1455 ran successfully in debug mode on my local machine with single and multiple cores. Please let me know if you have further issues.

indra098124 commented 8 months ago

Thank you @AMLattanzi . I modified my copy to have IntVect ng = Theta_prim[lev]->nGrowVect(); in ERF.cpp and in ERF_Advance.cpp, still failing for me. I will try the version from PR.

AMLattanzi commented 8 months ago

@indra098124 Yes it should fail still with that revision. The creation of the MOST class and the calls to the MOST averaging needed to be moved later after the ghost cells were populated by FillPatch. If you see the issue arise, or a new issue, with the current development (e9bcaa0) let me know.

indra098124 commented 8 months ago

@AMLattanzi unfortunately, it is still failing for me with the latest version. I tried debug version as well. With debug I get the following error (on Mac and on Linux).

amrex::Abort::1:: (127,-1,-1,0) is out of bound (125:258,-3:10,0:63,0:0) !!! SIGABRT amrex::Abort::0:: (117,1,-1,0) is out of bound (-3:130,-3:10,0:63,0:0) !!! SIGABRT

I tried running realclean and also a fresh download.

indra098124 commented 8 months ago

@AMLattanzi and @asalmgren there are other cases as well that are failing for me. I am not sure if I am doing something wrong.

  1. ABL/inputs.write -> The input filed needed prob.T_0 = 300.0, after that it worked.
  2. ABL/inputs.read -> This has been giving segfault. Backtrace points to if (input_bndry_planes && m_r2d->ingested_velocity()) in ERF_init_bcs.cpp:86). Debug or Assertion don't tell anything more. I did generate boundary files using inputs.write before trying this.
  3. ABL_input_sounding does not compile. I just needed input_sounding that put me on track on finding the issue with this code compilation. This error is related to "USE_POISSON_SOLVE = TRUE". It gives an error /TI_headers.H:270:30: error: 'Vector' does not name a type 270 | const Vector<amrex::Real> d_rayleigh_ptrs_at_lev); I realized that it is do with USE_POISSON_SOLVE = TRUE. I think it should be amrex::Vector. There was another error about use_rayleigh_damping not being declared which might be a typo as other places I find it is referenced as solverChoice.use_rayleigh_damping. At TI_no_substep_fun.H:133:13 the code complains that incompressible is not declared. Lastly, At TI_slow_rhs_fun.H:357:25: I get an error: cannot convert 'std::unique_ptr' to 'const amrex::MultiFab' erf_slow_rhs_inc(level, nrk, slow_dt. I could use input_sounding when I disable poisson_solve.

Thank you!

asalmgren commented 8 months ago

I believe we didn’t mean to build with USE_POISSON_SOLVE on. If you set that to false does it build ok?

Thank you for all the great feedback! We need to do a better job of making sure the jnputs files in the repo work correctly

Ann Almgren Senior Scientist; Dept. Head, Applied Mathematics Pronouns: she/her/hers

On Sun, Feb 25, 2024 at 1:23 PM indra098124 @.***> wrote:

@AMLattanzi https://github.com/AMLattanzi and @asalmgren https://github.com/asalmgren there are other cases as well that are failing for me as well. I am not sure if I am doing something wrong.

  1. ABL/inputs.write -> The input filed needed prob.T_0 = 300.0, after that it worked.
  2. ABL/inputs.read -> This has been giving segfault. Backtrace points to if (input_bndry_planes && m_r2d->ingested_velocity()) in ERF_init_bcs.cpp:86). Debug or Assertion don't tell anything more.
  3. ABL_input_sounding does not compile. I just needed input_sounding that put me on track on finding the issue with this code compilation. This error is related to "USE_POISSON_SOLVE = TRUE". It gives an error /TI_headers.H:270:30: error: 'Vector' does not name a type 270 | const Vectoramrex::Real d_rayleigh_ptrs_at_lev); I realized that it is do with USE_POISSON_SOLVE = TRUE. I think it should be amrex::Vector. There was another error about use_rayleigh_damping not being declared which might be a typo as other places I find it is referenced as solverChoice.use_rayleigh_damping. At TI_no_substep_fun.H:133:13 the code complains that incompressible is not declared. Lastly, At TI_slow_rhs_fun.H:357:25: I get an error: cannot convert 'std::unique_ptramrex::MultiFab' to 'const amrex::MultiFab' erf_slow_rhs_inc(level, nrk, slow_dt. I could use input_sounding when I disable poisson_solve.

Thank you!

— Reply to this email directly, view it on GitHub https://github.com/erf-model/ERF/issues/1453#issuecomment-1963065699, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6YVUYY3YDP2HIW6T47TYVOTTVAVCNFSM6AAAAABDV747CWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRTGA3DKNRZHE . You are receiving this because you were mentioned.Message ID: @.***>

indra098124 commented 8 months ago

Thank you @asalmgren and thank you ERF development team for making the software available open source. Yes, after disabling the poisson solver, I can build and run this.

Last thing I am figuring out is to use boundary input.

AMLattanzi commented 8 months ago

@indra098124 sounds like things are alright on this front? Are we good to close this particular issue?

AMLattanzi commented 8 months ago

I believe the inputs.write and inputs.read should work once PR 1461 goes through.

indra098124 commented 8 months ago

@AMLattanzi thanks for following up. I am not sure, but the most with terrain still fails for me with the following error?

amrex::Abort::1:: (127,-1,-1,0) is out of bound (125:258,-3:10,0:63,0:0) !!! SIGABRT amrex::Abort::0:: (117,1,-1,0) is out of bound (-3:130,-3:10,0:63,0:0) !!! SIGABRT

I am not sure. May I confirm if you were able to run terrain3d_Hemisphere successfully?

AMLattanzi commented 8 months ago

Ah, I have not tested hemisphere with MOST! Let me give that a go and I can either follow up with the results or create a PR to alleviate the issue. Thanks for clarifying.

AMLattanzi commented 8 months ago

@indra098124 I believe I have corrected the issue with MOST and the 3d hemisphere in PR 1465. Thank you again for bringing these issues to our attention, we greatly appreciate the feedback.

indra098124 commented 8 months ago

Thank you @AMLattanzi for your help.

indra098124 commented 8 months ago

@AMLattanzi after the new fix, the inputs_most_test in ABL seems to be broken. I find that if used erf.most.average_policy = 0, the code diverges at first time step with the error "0::Assertion `cell_data(i,j,k,RhoTheta_comp) > 0.' failed, file "../../Source/TimeIntegration/ERF_slow_rhs_pre.cpp", line 566" . most_average_policy =1 works fine. Would you mind having a look?

Many thanks

indra098124 commented 8 months ago

Additionally, looks like there is some issue with MOST with surface temperature. It always gives SIGILL Invalid, privileged, or ill-formed instruction. For e.g. see GABLS1 case.

AMLattanzi commented 8 months ago

@indra098124 The issue with the hemisphere should be corrected in PR 1468. The salient problem was that the turbulent viscosity was 0 for the given initialization; this is inconsistent with the MOST BC and the limiting we did with 1e-16 was not sufficient for stability. I also added an option for small perturbations in the IC to give finite strain and thus non-zero turbulent viscosity with Smagorinsky (the fluctuations seem to dissipate quickly). This ran for planar and local average for 10 steps.

With respect to the GABLS case, I am unable to replicate that issue. The instruction error you mention sounds like the mac issue Bruce explained. I have yet to see that error on a Linux machine with ERF. Perhaps try in DEBUG mode.

indra098124 commented 8 months ago

Thanks @AMLattanzi . This PR seems to have fixed the other issues (GABLS and ABLMost). I can see the ABLMost regression test ran successfully (https://ccse.lbl.gov/pub/RegressionTesting1/ERF/) while it was failing earlier today. Also thank you for explaining what was wrong.

Many thanks

asalmgren commented 8 months ago

@indra098124 -- are we good to close this issue?

indra098124 commented 8 months ago

Thank you @AMLattanzi. Yes @asalmgren we can close this.