esm-tools / esm_tools

Simple Infrastructure for Earth System Simulations
https://esm-tools.github.io/
GNU General Public License v2.0
25 stars 12 forks source link

AWI-CM1 in Levante suddenly stops #807

Closed antoniofis10000 closed 1 month ago

antoniofis10000 commented 2 years ago

Dear Developers. I do not know why my AWI-CM1 runs in Levante suddenly stop without any apparent reason. This happens in two contexts.

The first is when I try to run a two months extension (same issue in Mistral), see for example: /work/ab0995/from_Mistral/ab0995/a270148/esm-experiments/AWICM/OctoberNovember2020NudgingOldE4. It seems the model suddenly stops after one month, so if I do not kill the process, it will continue running without any advance. Therefore I need to do monthly restarts (OctoberNovember2020NudgingOldE41Month), which is suboptimal.

The second one is when I try to run the model for a few days. It runs for only one day (/work/ab0995/from_Mistral/ab0995/a270148/esm-experiments/AWICM/1Dayv3) but when I try to run it for 4 (/work/ab0995/from_Mistral/ab0995/a270148/esm-experiments/AWICM/4Daysv1), 5 or 16 days (/work/ab0995/from_Mistral/ab0995/a270148/esm-experiments/AWICM16Daysv1) it stops on day 2.

Thanks in advance. Best wishes, Antonio.

pgierz commented 2 years ago

Moin Antonio

That in both cases sounds like a funky and badly configured namelist. Since you are using AWICM1, that will be ECHAM. Can you post the dt start, dt stop, lresume, putrerun and putdata settings?

I'll also have some time to look in detail, but not until Thursday I'm afraid.

Best PG

antoniofis10000 commented 2 years ago

In both cases it seems these parameters are correct (no reason to stop the simulation I think). In the first case (run stopped after one month instead of running for two months) according to: /work/ab0995/from_Mistral/ab0995/a270148/esm-experiments/AWICM/OctoberNovember2020NudgingOldE4/run_20201001-20201130/config/OctoberNovember2020NudgingOldE4_finished_config.yaml These parameters are:

               dt_resume:
                - 2020
                - 10
                - 1
                  dt_stop:
                - 2020
                - 12
                - 1
                lresume: true
                putrerun:
                - 2
                - months
                - first
                - 0
                putdata:
                - 1
                - hours
                - first
                - 0

In the second one (run for few days stopped after 2 days) for example for the /work/ab0995/from_Mistral/ab0995/a270148/esm-experiments/AWICM/16Daysv1 simulation these parameters are (extracted from the same file as above):

                dt_resume:
                - 2019
                - 1
                - 1
                dt_stop:
                - 2019
                - 1
                - 17
                lresume: true
                putdata:
                - 1
                - hours
                - first
                - 0
                putrerun:
                - 16
                - days
                - first
                - 0

Thanks. Best wishes, Antonio.

antoniofis10000 commented 2 years ago

Hi @pgierz. Do you have any news?

Thanks. Best wishes, Antonio.

pgierz commented 2 years ago

Hi @antoniofis10000, I'm on holiday this week so I have not had time to look into it.

To help diagnose further: Does the problem also occur with 1xmonth or 1xyear configurations? Is it only happening with greater than 1?

denizural commented 2 years ago

Hi @antoniofis10000, I am back so I can take over this problem

antoniofis10000 commented 2 years ago

Hi @pgierz and @denizural

Hi @antoniofis10000, I'm on holiday this week so I have not had time to look into it.

To help diagnose further: Does the problem also occur with 1xmonth or 1xyear configurations? Is it only happening with greater than 1? Not always, sometimes (as for example restarting on 1st January 2017) I am able to run 6 months (see for example: /work/ab0995/from_Mistral/ab0995/a270148/esm-experiments/AWICM/E3PresFromBegv4)

@denizural Thanks! Please, let me know if you have any new.

Best wishes, Antonio.

antoniofis10000 commented 2 years ago

Dear Developers. Do you have any news about this issue? We need to have this issue solved by next Monday.

Best wishes, Antonio.

pgierz commented 2 years ago

Hi Antonio,

Do you have any info for the runs that crash after 6 months? Is it a numerical problem, or at the tools actually acting up? Seeing a copy/paste of a log dump would be useful…

I’ll also ping @mandresm (Sorry Miguel). My own time for support is limited this week as I’m helping with very strange problems on AWI’s new supercomputer so unfortunately I do not know if I will have time to look at this until Friday, which I guess is a little bit tight for your schedule.

Best Paul

antoniofis10000 commented 2 years ago

Dear Developers. I think I do not have any simulations crashing after 6 months.

I am not sure if the restarts are properly configured in our run. As I think, it seems ECHAM ends, but it is still waiting for something (FESOM?). I tried a new 16 days run setting final_date to 2017-01-17 as well as the new oasis3mct restart=TRUE (see it in /work/ba1264/a270148/esm-experiments/AWICM/Test16Days13Sept) and now it seems it runs until 2017-01-12 (instead of 2017-01-02) but then ECHAM seems to finish (/work/ba1264/a270148/esm-experiments/AWICM/Test16Days13Sept/run_20170101-20170116/work/echam.stderr) but in /work/ba1264/a270148/esm-experiments/AWICM/Test16Days13Sept/log/Test16Days13Sept_awicm_observe_compute_20170101-20170116.log it says still running, and yes, if I do not interrupt it, it will reach the 8 hours wall time limit. Consequently it do not properly ends (and for example the files are not moved to their directories)

The same simulation works properly is restart rate is set to 6 months (/work/ba1264/a270148/esm-experiments/AWICM/E2FreeRun20172017_2448_6MR)

Thanks in advance. Best wishes, Antonio.

denizural commented 2 years ago

Hi Antonio, I am working on this issue. It seems like you have many uncommited changes in your ESM-Tools. I am using our latest release to fix this issue.

mandresm commented 2 years ago

Hi Antonio, how long were your runs in mistral? And how long are your runs now?

mandresm commented 2 years ago

@denizural , I'll be having a look at this issue as well. I've been working with Antonio quite heavily recently, and he needs results for Monday, so it's okay to invest 2 people on this to make sure he gets what he needs.

antoniofis10000 commented 2 years ago

Thanks so much. In Mistral we started to run the simulations with restart every 3 months. But when we reached October 2020 we needed to extend the simulations only 2 months and then this issue appears for the first time (""""""solved""""" using restart every month). (I think I deleted all these simulations from Mistral but I had the same issue in Levante (/work/ab0995/from_Mistral/ab0995/a270148/esm-experiments/AWICM/OctoberNovember2020NudgingOldE4).

Just as a remainder the 16 days simulations are using as a restart point the CMIP6 simulations from Tido which are restarted every 12 months.

As probably expected I am able to run a 6 (/work/ba1264/a270148/esm-experiments/AWICM/Test6Days13Sept) or 12 days simulation (/work/ba1264/a270148/esm-experiments/AWICM/Test12Days14Sept).

antoniofis10000 commented 2 years ago

Dear Developers. Another point. I tried to run a 20 days simulation (/work/ba1264/a270148/esm-experiments/AWICM/Test20Days14Sept/) and it stops after day 4. Now I do not understand nothing (even less)

antoniofis10000 commented 2 years ago

Neither changing Trigfiles (to 16: /work/ba1264/a270148/esm-experiments/AWICM/Test16Days13SeptTrigFiles or 17: /work/ba1264/a270148/esm-experiments/AWICM/Test16Days13SeptTrigFiles17Days) nor the restart rate (/work/ba1264/a270148/esm-experiments/AWICM/Test16Days13SeptRestartRate1) work.

mandresm commented 2 years ago

Dear Developers. Another point. I tried to run a 20 days simulation (/work/ba1264/a270148/esm-experiments/AWICM/Test20Days14Sept/) and it stops after day 4. Now I do not understand nothing (even less)

when you say "it stops" do you mean it holds as the previous one?

antoniofis10000 commented 2 years ago

Yes, sorry, but instead of day 12 in this case in day 4.

mandresm commented 2 years ago

Can you provide the path to your esm_tools installation?

antoniofis10000 commented 2 years ago

/work/ab0995/a270148/ESMTOOLS10August2022

mandresm commented 2 years ago

Do you get any missing files when you submit the 16 days simulation?

antoniofis10000 commented 2 years ago

Yes, some related with OASIS and one with JSBACH, but are the same as in the simulations used for the plot (timeseries and maps) shown today.

mandresm commented 2 years ago

Me too, the oasis ones are very worrying... can you copy the missing oasis files into the oasis restart folder?

mandresm commented 2 years ago

What I am suspecting right now: the remapping files are missing, and in my simulation (copied from yours) it's taking a very long time to build the remapping files. It might be that we are reaching the time wall simply because we are missing the remapping files and then oasis has to build them, and that takes time.?

antoniofis10000 commented 2 years ago

I have copied these OASIS restart files from the 1-year simulation in Levante (/work/ba1264/a270148/esm-experiments/AWICME2FreeRun20172017_2448_6MRRestartOasisTrue) and I have run the 16 days simulation with trigfiles set to 16 days. No changes can be observed (/work/ba1264/a270148/esm-experiments/AWICM/Test16Days13SeptTrigFilesOasis3). It is true the computational time is now reduced but I think is it not important (it was around 30').

mandresm commented 2 years ago

okay, so that was only one minor problem... I am currently running a simulation with the latest fix/echam_default_namelists_ssp without your local changes, just in case. I'll report when I have some results

antoniofis10000 commented 2 years ago

FYI not changes when: nproca: 24 nprocb: 48 (Same as in Mistral) /work/ba1264/a270148/esm-experiments/AWICM/Test16Days13SeptOasis32448

mandresm commented 2 years ago

Antonio, were you able to run 16 days runs ever on mistral with the old tools? If so, do you have an experiment I can look at?

Something new: if you set all the time variables of echam in the namelist.echam to have last instead of first (the ESM-Tools default for most of this variables) including putrerun, putdata... you get the following error fromFESOMafter day 12: logfile: /work/ab0995/a270152/16DaysAntonio2/log/16DaysAntonio2_awicm_compute_20170101-20170116_1995964.lo

 768: step:   2200, day:  12, year: 2017, duration: 0.22246s, total wallclock: +70.36713s = 00:30:29
 768:  Fluxes have been modified.
 768:  Fluxes have been modified.
 768:  Fluxes have been modified.
 768:  Fluxes have been modified.
 768:  Fluxes have been modified.
 768:  Fluxes have been modified.
 768:  Fluxes have been modified.
 768:  Fluxes have been modified.
 768:  Fluxes have been modified.
 768:  Fluxes have been modified.
 768:  Fluxes have been modified.
 768:  Fluxes have been modified.
 768:  Fluxes have been modified.
 768: step:   2300, day:  12, year: 2017, duration: 0.19193s, total wallclock: +77.50272s = 00:31:46
 514: [l30020:1370733:0:1370733] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x178)
 514: ==== backtrace (tid:1370733) ====
 514:  0 0x0000000000012b20 .annobin_sigaction.c()  sigaction.c:0
 514:  1 0x00000000012db578 mo_transpose_mp_gather_gp2_()  ???:0
 514:  2 0x0000000000d22d16 mo_couple_mp_smooth_flux_()  ???:0
 514:  3 0x0000000000d2265f mo_couple_mp_couple_put_a2o_()  ???:0
 514:  4 0x0000000000d1e59a mo_couple_mp_couple_end_()  ???:0
 514:  5 0x0000000000b77915 MAIN__()  ???:0
 514:  6 0x0000000000415fa2 main()  ???:0
 514:  7 0x0000000000023493 __libc_start_main()  ???:0
 514:  8 0x0000000000415eae _start()  ???:0
 514: =================================
 526: [l30020:1370745:0:1370745] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x178)
 520: [l30020:1370739:0:1370739] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x178)
 521: [l30020:1370740:0:1370740] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x178)
 522: [l30020:1370741:0:1370741] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x178)
 523: [l30020:1370742:0:1370742] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x178)
 524: [l30020:1370743:0:1370743] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x178)
 562: [l30020:1370781:0:1370781] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x178)
 563: [l30020:1370782:0:1370782] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x178)
 527: [l30020:1370746:0:1370746] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x178)
 519: [l30020:1370738:0:1370738] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x178)
 561: [l30020:1370780:0:1370780] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x178)
 564: [l30020:1370783:0:1370783] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x178)

...

  85: [l10442:866255:0:866255] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x178)
  89: [l10442:866259:0:866259] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x178)
 112: [l10442:866282:0:866282] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x178)
 564: ==== backtrace (tid:1370783) ====
 564:  0 0x0000000000012b20 .annobin_sigaction.c()  sigaction.c:0
 564:  1 0x00000000012db578 mo_transpose_mp_gather_gp2_()  ???:0
 564:  2 0x0000000000d22d16 mo_couple_mp_smooth_flux_()  ???:0
 564:  3 0x0000000000d2265f mo_couple_mp_couple_put_a2o_()  ???:0
 564:  4 0x0000000000d1e59a mo_couple_mp_couple_end_()  ???:0
 564:  5 0x0000000000b77915 MAIN__()  ???:0
 564:  6 0x0000000000415fa2 main()  ???:0
 564:  7 0x0000000000023493 __libc_start_main()  ???:0
 564:  8 0x0000000000415eae _start()  ???:0
 564: =================================
 518: ==== backtrace (tid:1370737) ====
 518:  0 0x0000000000012b20 .annobin_sigaction.c()  sigaction.c:0
 518:  1 0x00000000012db578 mo_transpose_mp_gather_gp2_()  ???:0
 518:  2 0x0000000000d22d16 mo_couple_mp_smooth_flux_()  ???:0
 518:  3 0x0000000000d2265f mo_couple_mp_couple_put_a2o_()  ???:0
 518:  4 0x0000000000d1e59a mo_couple_mp_couple_end_()  ???:0
 518:  5 0x0000000000b77915 MAIN__()  ???:0
 518:  6 0x0000000000415fa2 main()  ???:0
 518:  7 0x0000000000023493 __libc_start_main()  ???:0
 518:  8 0x0000000000415eae _start()  ???:0
 518: =================================

...

1174: forrtl: error (78): process killed (SIGTERM)
1174: Image              PC                Routine            Line        Source             
1174: fesom              00000000007C582B  for__signal_handl     Unknown  Unknown
1174: libpthread-2.28.s  0000155549F62B20  Unknown               Unknown  Unknown
1174: libmpi.so.40.30.2  000015554A3B9F5E  mca_pml_ucx_recv      Unknown  Unknown
1174: libmpi.so.40.30.2  000015554A26A76A  ompi_coll_base_ba     Unknown  Unknown
1174: libmpi.so.40.30.2  000015554A20E331  PMPI_Barrier          Unknown  Unknown
1174: libmpi_mpifh.so.4  000015554A76A133  MPI_Barrier_f08       Unknown  Unknown
1174: fesom              0000000000435A55  Unknown               Unknown  Unknown
1174: fesom              000000000041DD86  Unknown               Unknown  Unknown
1174: fesom              0000000000414522  Unknown               Unknown  Unknown
1174: libc-2.28.so       0000155549BAE493  __libc_start_main     Unknown  Unknown
1174: fesom              000000000041442E  Unknown               Unknown  Unknown
2236: forrtl: error (78): process killed (SIGTERM)
2236: Image              PC                Routine            Line        Source             
2236: fesom              00000000007C582B  for__signal_handl     Unknown  Unknown
2236: libpthread-2.28.s  0000155549F62B20  Unknown               Unknown  Unknown
2236: libmpi.so.40.30.2  000015554A3B9F5E  mca_pml_ucx_recv      Unknown  Unknown
2236: libmpi.so.40.30.2  000015554A26A76A  ompi_coll_base_ba     Unknown  Unknown
2236: libmpi.so.40.30.2  000015554A20E331  PMPI_Barrier          Unknown  Unknown
2236: libmpi_mpifh.so.4  000015554A76A133  MPI_Barrier_f08       Unknown  Unknown
2236: fesom              0000000000435A55  Unknown               Unknown  Unknown
2236: fesom              000000000041DD86  Unknown               Unknown  Unknown
2236: fesom              0000000000414522  Unknown               Unknown  Unknown
antoniofis10000 commented 2 years ago

I think finally I did not try it in the old machine (as we were not able to produce daily restart files with the old tools anyway).

Something related with the coupler?

mandresm commented 2 years ago

Looks like, I'm trying with daily runs, and let's see what happens.

pgierz commented 2 years ago

Just catching up here, and if you need an oasis insider let me know, I spent some time working on that code.

If it helps, this here:

514: [l30020:1370733:0:1370733] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x178)

might be something like a hardware hiccup. Looks like the memory of that computer just suddenly vanished.

One possible idea to exclude hardware: does this configuration run nicely on, e.g. Ollie? Is Levante just being strange?

mandresm commented 2 years ago

Thanks for the info @pgierz. It's very consistently happening in step 12 for a 16 day simulation, but it also happens in step 1 of a daily simulation. I am now testing with fesom-1.4 standalone. @antoniofis10000 , can you maybe try in Ollie?

denizural commented 2 years ago

I am also getting SIG 11 fault quite a lot on Levante recently.

antoniofis10000 commented 2 years ago

I will try in Ollie but it will take me some time (I have to transfer the data from Levante and probably adapt some .yaml)

mandresm commented 2 years ago

Before that I would recommend that you run a cold run in Levante with 16 days restarts and you see if you are getting the same problem. If you do, then you can try a cold run in Ollie, that should safe you quite some time in terms of how much you have to transfer to Ollie

antoniofis10000 commented 2 years ago

Ok, I will keep you in track

antoniofis10000 commented 2 years ago

I think (as also happened in Mistral) is not possible to do a cold run in Levante (at least for ssp370) as the model blows up (/work/ba1264/a270148/esm-experiments/AWICM/ColdRun).

mandresm commented 2 years ago

Fesom-1.4 and ECHAM-6.3.04p1 can run daily standalone simulations. I'm gonna test now a simpler AWICM1-CMIP6

mandresm commented 2 years ago

I can also do daily restarts with AWICM1-CMIP6 and 16 days restart in Levante.

mandresm commented 2 years ago

For the AWICM1 with 16 days restart simulation you can check this experiment folder: /work/ab0995/a270152/awicm1cmip6_16days

antoniofis10000 commented 2 years ago

Two points: Is really surprising to me that you are able to run this cold run. I think using GLOB and different nodes distribution it blows up (/work/ba1264/a270148/esm-experiments/AWICM/PIControl).

In this simulation another significant difference is the use of restart file set to 1. This change seems not to solve the problem (/work/ba1264/a270148/esm-experiments/AWICM/MiguelOption)

antoniofis10000 commented 2 years ago

FYI, I have the same issue extending for 6 months an old MISTRAL simulation (/work/ba1264/a270148/esm-experiments/AWICM/E1HistFrom1018532448RestartTrue), in this case after 3 months (the restart period in Mistral).

mandresm commented 2 years ago

In this simulation another significant difference is the use of restart file set to 1. This change seems not to solve the problem (/work/ba1264/a270148/esm-experiments/AWICM/MiguelOption)

1 is converted internally to true for the choose_. So it's basically the same thing. The recommendation is to use true or false instead of 0 and 1.

nwieters commented 1 year ago

Hi @antoniofis10000,

is this issue solved already or is it outdated?

If this is still an issue on levante I can try to reproduce this again. If not, can this issue be closed?

Thanks for your update to this.

antoniofis10000 commented 1 year ago

Hi @nwieters. No, this issue is not solved. Most of the time (but not always), I am not able to extend an experiment with AWI-CM1 with a longer run than in the previous extension (e.g. if previously the experiment was extended just one month, most of the following experiments need to have a restart every month, if not the model stops after one month without any apparent reason).

I think I have deleted all the failed experiments, but I can produce one. I will let you know when I have a path to look at.

Thanks!. Best wishes, Antonio.

github-actions[bot] commented 2 months ago

This issue has been inactive for the last 365 days. It will now be marked as stale and closed after 30 days of further inactivity. Please add a comment to reset this automatic closing of this issue or close it if solved.