j34ni commented 4 years ago

Adding emission files significantly affects the reproducibility of the simulations, even if they only contain zeros (whether these zeros are defined as float or as double precision).

This is very likely linked to the problem that occurred in the summer 2019 (while carrying on the CMIP6 runs) although at that time it made the model crash more/less randomly, which is not the case here.

MichaelSchulzMETNO commented 4 years ago

@j34ni This should be certainly be tracked down. I wonder

is it a CAMS or NorESM problem. Does it also happen in AMIP configuration?
Can it be circumvented by manipulating the emissions without adding a file, eg by putting the SO2 emissions in one of the existing emission files to zero. Or add the Pinatubo emission in month 6 to SO2 fluxes in an existing file.
whether this happens on vilje and fram.
if the output is bit unidentical from the first month?

j34ni commented 4 years ago

Is it a CAM or NorESM problem: this is hard to say. I only made short runs and the problem was easy to spot after only a few months because there were already significant differences for instance in the sea ice (which do not occur with prescribed SSTs). I have asked Ada about what happened for CMIP and how they diagnosed the issues at the time.
Can it be circumvented: probably, but would it not be more sensible to find a more permanent solution since this bug is likely to also have other consequences not yet understood.
Only tried on Fram since I did not have CPU time on Vilje.
Not bit identical from the first month, yes, and some differences already clearly visible (sea ice fraction and probably other variables also).

DirkOlivie commented 4 years ago

Here some comments :

The problems we encountered with emission files (such a s (1) crash or (2) start of non-being bit identical) often started in the middle of the month. The atm.log file is a place where one can follow the state of the model every time step (and one can see when two simulations start to diverge).
Not all combinations of compsets and machines have been tested. However a few results are : (1) The problem only appeared on fram, not on vilje. (2) On Fram, it happened for the fully-coupled compsets when using 30 nodes (+/- standard setup). (3) On Fram, it happened for the fixed-SST compsets when using 32 nodes, not when using 16 nodes.
With the "frc2"-type compsets (which use less emission files), we avoided crashes and the problem of being not bit identical. Maybe it is an option to do the Pinatubo tests with the N1850frc2 compset.

MichaelSchulzMETNO commented 4 years ago

Suggestion from Thomas email: @tto061 @j34ni @DirkOlivie

run a parallel test with prescribed SSTs and sea-ice (e.g. NF2000climo compset)
run a parallel test with CESM cam (e.g. F2000climo compset, assuming you can adjust your input to suit MAM -- if not, please ignore)
run a parallel test without land component (QP compset; you'd need to reset all your inputs manually; I can probably help you with that).

j34ni commented 4 years ago

I ran a NF2000climo compset with and without additional zeros emission files and the results are different also!

tto061 commented 4 years ago

OK thanks Jean. So we've ruled out sea-ice. Do you think you can try test #2? also could you share your NorESM case directories and point to your NorESM root directory for these tests on fram?

j34ni commented 4 years ago

I have not done this particular test on Fram but on a Virtual Machine, with the same run-time environment (same compiler version, same libraries, etc.), without batch system or queuing time (and also with less computational resources).

Let me know if you want to look at particular files and I will put them somewhere accessible to you.

j34ni commented 4 years ago

As for CESM and the F2000climo compset, I ran it several times in similar conditions (f19 res) and it never crashed. Also it does not give different results when adding other emission files with zeros.

j34ni commented 4 years ago

I forgot to mention that I did all the CESM tests with the latest release (cesm2.1.3), is it worth trying older versions or should we focus on NorESM?

MichaelSchulzMETNO commented 4 years ago

I believe, we should just test the newest NorESM-CAM6-Nor without "coupling" to other components. My suspicion it is related to the emissions read in CAM in combination with some other feature of the aerosol or CAM-Nor code.

A test could be to see if NF2000climo, CAM6-NOR with MAM4 aerosol can be run. @DirkOlivie is that possible? Would be interesting anyway.

j34ni commented 4 years ago

It seems to me that there are several problems which may or may not be related: i) intermittent NorESM crashes (occurrence of NaNs & INFs), ii) non bit-for-bit reproducibility, and iii) issues when reading the emission files

DirkOlivie commented 4 years ago

The NF2000climo compset or the more recent CAM6-Nor compsets impose using the CAM-Oslo aerosol scheme (essential part of the compset definition). Have the frc2 compsets been tested in this context? Has a test been done without any emissions?

oyvindseland commented 4 years ago

If no-one else does I can check if adding zero when reading in existing files matter, i.e. before the numbers are scattered to the chunks

tto061 commented 4 years ago

Hi Dirk & Øyvind et al

I'm using them routinely on tetralith, without any problems -- but only after I switched from CLM50%BGC-CROP to CLM50%SP; although my error may have looked more like what Øyvind's getting (crash with NaNs on land points).

Never tried without (or zero) emissions.

Cheers Thomas

On 2020-05-04 15:49, DirkOlivie wrote:

The NF2000climo compset or the more recent CAM6-Nor compsets impose using the CAM-Oslo aerosol scheme (essential part of the compset definition). Have the frc2 compsets been tested in this context? Has a test been done without any emissions?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/NorESMhub/NorESM/issues/55#issuecomment-623475658, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZGLJCKMI36FC2BBT5FIC3RP3BYFANCNFSM4L5U7ZHQ.

oyvindseland commented 4 years ago

Hi

I can elaborate a bit more on my comment above. The check that can be done is to add an extra input sector at the point where the input files are read in but instead of reading in a zero file just define the input array to be zero. The purpose would be to check if it is the read-in process itself that causes the problem or whether it is when defining new sectors. If the results are different still the addition of zero can be done further down in the physics structure.

monsieuralok commented 4 years ago

update from 22/10/2019 When I was executing compset NFPTAERO60 with grid f19_f19_mg17, I was getting some strange value of field names which is getting crash atleast when I compile with MPI+OpenMP.

I printed the following block from file ndrop.F90 around line 2172

ifdef OSLO_AERO

tendencyCounted(:)=.FALSE. do m = 1, ntot_amode do l = 1, nspec_amode(m) mm = mam_idx(m,l) lptr = getTracerIndex(m,l,.false.) if(.NOT. tendencyCounted(lptr))then print*,mm,fieldname(mm),'ndrop' call outfld(fieldname(mm), coltend(:,lptr), pcols,lchnk) call outfld(fieldname_cw(mm), coltend_cw(:,lptr), pcols,lchnk) tendencyCounted(lptr)=.TRUE. endif end do end do

endif

I get :

        8 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ndrop
      12 BC_AI_mixnuc1           ndrop
      13 OM_AI_mixnuc1           ndrop
      15 SO4_A2_mixnuc1          ndrop
      18 SO4_PR_mixnuc1          ndrop
      19 BC_AC_mixnuc1           ndrop
      20 OM_AC_mixnuc1           ndrop
      22 SO4_AC_mixnuc1          ndrop
      26 DST_A2_mixnuc1          ndrop
      34 DST_A3_mixnuc1          ndrop
      35 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ndrop

When, I printed fieldname(mm) for mm=8,14,35 @ line 263 in ndrop.F90; I guess it is never assigned any value or initialized.

Second, it could be that it should not loop over these numbers. Please could you check and update.

tto061 commented 4 years ago

Further adding to the picture: as far as I can tell, none of my integrations on tetralith, including the NFHIST cases for CMIP6, are reproducible bfb with default compiler options (i.e. -O2 doe fortran), either from existing restarts or from default i.c. I have no reproducibility test with -O0 option.

j34ni commented 4 years ago

I am investigating the bug with different tools (like the Intel Inspector) for memory and thread checking and debugging. I think I am getting there. That now seems to work on a virtual machine.

DirkOlivie commented 4 years ago

@j34ni A temporary solution might be to use one 3D SO2--emission file, which will contain the standard 3D emissions + Pinatubo explosion emission. Would you like me to create such a file?

j34ni commented 4 years ago

@DirkOlivie We can give it a go

j34ni commented 4 years ago

@MichaelSchulzMETNO @DirkOlivie @monsieuralok @tto061 I eventually got NorESM working in the Conda environment (with a GNU compiler) and did not manage to make it to crash yet!

There may be something very wrong with the Intel 2018 compiler, as was already the case when I was running the Variable Resolution CESM (for which I ended up using Intel 2019)

MichaelSchulzMETNO commented 4 years ago

@j34ni Did you / could you explain how one can run NorESM in a conda enviroment? Is that in the NorESM2 documentation already? ( I mean thats really interesting to have !!)

oyvindseland commented 4 years ago

@j34ni Really great news that you can run NorESM in a Conda environment. It is going to be interesting to see scaling results.

j34ni commented 4 years ago

@MichaelSchulzMETNO At the moment that has not been much documented, it is still work in progress building on the "conda cesm" recipe. That was mainly used for teaching purposes (to learn how to run an ESM on Galaxy), and for development (without having to wait in a queue). However a proper "conda noresm" will be made available soon, that will allow a simple installation and contain what's needed to run the model (including configuration files, the Microsoft Kernel Library instead of Blas/Lapack, etc.) on generic platform first, and after that on an HPC.

j34ni commented 4 years ago

@oyvindseland Yes, we will have to evaluate the scalability on an HPC (so far that only used small configurations on virtual machines with a single node), Betzy comes at the perfect time...

j34ni commented 4 years ago

@MichaelSchulzMETNO @DirkOlivie @monsieuralok @tto061 Some of the problems occur at the very beginning of a run: initialization issues (obviously) but also non-BFB reproducibility and even crashes due to NaNs or INFs.

To test that quickly:

create a new case (for example original_N1850_f19_tn14) and run the simulation for 1 day;
create a 1st branch from the original case (needs a copy of the restart files & rpointers from the original case in the run dir) and continue the run for a single time step;
create a 2nd branch from the original case, add a couple of zero_emission files (copy also the restarts in the run dir) and run it for 1 time step;
continue the original simulation for 1 time step (CONTINUE_RUN=TRUE);
compare the original case with the 1st and 2nd branches.

I did that many times with CESM and the 3 simulations provide identical results, systematically.

So far with NorESM that only worked with the GNU (for instance 9.3.0) and Intel (2019.5) compilers, not with Intel 2018, whether it makes use of Alok's SourceMods or not.

That is not meant to replace a long run, but it is much faster to evaluate the effect of various fixes: if the 3 simulations do not provide identical results after one time step, there is no need to waste more resources. However if they do provide identical results, the simulation can always fail later.

NorESMhub / NorESM

Highly suspected "bug" related to emission files & reproducibility issues #55

ifdef OSLO_AERO

endif