Open j34ni opened 4 years ago
@j34ni This should be certainly be tracked down. I wonder
Is it a CAM or NorESM problem: this is hard to say. I only made short runs and the problem was easy to spot after only a few months because there were already significant differences for instance in the sea ice (which do not occur with prescribed SSTs). I have asked Ada about what happened for CMIP and how they diagnosed the issues at the time.
Can it be circumvented: probably, but would it not be more sensible to find a more permanent solution since this bug is likely to also have other consequences not yet understood.
Only tried on Fram since I did not have CPU time on Vilje.
Not bit identical from the first month, yes, and some differences already clearly visible (sea ice fraction and probably other variables also).
Here some comments :
Suggestion from Thomas email: @tto061 @j34ni @DirkOlivie
I ran a NF2000climo compset with and without additional zeros emission files and the results are different also!
OK thanks Jean. So we've ruled out sea-ice. Do you think you can try test #2? also could you share your NorESM case directories and point to your NorESM root directory for these tests on fram?
I have not done this particular test on Fram but on a Virtual Machine, with the same run-time environment (same compiler version, same libraries, etc.), without batch system or queuing time (and also with less computational resources).
Let me know if you want to look at particular files and I will put them somewhere accessible to you.
As for CESM and the F2000climo compset, I ran it several times in similar conditions (f19 res) and it never crashed. Also it does not give different results when adding other emission files with zeros.
I forgot to mention that I did all the CESM tests with the latest release (cesm2.1.3), is it worth trying older versions or should we focus on NorESM?
I believe, we should just test the newest NorESM-CAM6-Nor without "coupling" to other components. My suspicion it is related to the emissions read in CAM in combination with some other feature of the aerosol or CAM-Nor code.
A test could be to see if NF2000climo, CAM6-NOR with MAM4 aerosol can be run. @DirkOlivie is that possible? Would be interesting anyway.
It seems to me that there are several problems which may or may not be related: i) intermittent NorESM crashes (occurrence of NaNs & INFs), ii) non bit-for-bit reproducibility, and iii) issues when reading the emission files
The NF2000climo compset or the more recent CAM6-Nor compsets impose using the CAM-Oslo aerosol scheme (essential part of the compset definition). Have the frc2 compsets been tested in this context? Has a test been done without any emissions?
If no-one else does I can check if adding zero when reading in existing files matter, i.e. before the numbers are scattered to the chunks
Hi Dirk & Øyvind et al
I'm using them routinely on tetralith, without any problems -- but only after I switched from CLM50%BGC-CROP to CLM50%SP; although my error may have looked more like what Øyvind's getting (crash with NaNs on land points).
Never tried without (or zero) emissions.
Cheers Thomas
On 2020-05-04 15:49, DirkOlivie wrote:
The NF2000climo compset or the more recent CAM6-Nor compsets impose using the CAM-Oslo aerosol scheme (essential part of the compset definition). Have the frc2 compsets been tested in this context? Has a test been done without any emissions?
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/NorESMhub/NorESM/issues/55#issuecomment-623475658, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZGLJCKMI36FC2BBT5FIC3RP3BYFANCNFSM4L5U7ZHQ.
Hi
I can elaborate a bit more on my comment above. The check that can be done is to add an extra input sector at the point where the input files are read in but instead of reading in a zero file just define the input array to be zero. The purpose would be to check if it is the read-in process itself that causes the problem or whether it is when defining new sectors. If the results are different still the addition of zero can be done further down in the physics structure.
update from 22/10/2019 When I was executing compset NFPTAERO60 with grid f19_f19_mg17, I was getting some strange value of field names which is getting crash atleast when I compile with MPI+OpenMP.
I printed the following block from file ndrop.F90 around line 2172
tendencyCounted(:)=.FALSE. do m = 1, ntot_amode do l = 1, nspec_amode(m) mm = mam_idx(m,l) lptr = getTracerIndex(m,l,.false.) if(.NOT. tendencyCounted(lptr))then print*,mm,fieldname(mm),'ndrop' call outfld(fieldname(mm), coltend(:,lptr), pcols,lchnk) call outfld(fieldname_cw(mm), coltend_cw(:,lptr), pcols,lchnk) tendencyCounted(lptr)=.TRUE. endif end do end do
I get :
8 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ndrop
12 BC_AI_mixnuc1 ndrop
13 OM_AI_mixnuc1 ndrop
15 SO4_A2_mixnuc1 ndrop
18 SO4_PR_mixnuc1 ndrop
19 BC_AC_mixnuc1 ndrop
20 OM_AC_mixnuc1 ndrop
22 SO4_AC_mixnuc1 ndrop
26 DST_A2_mixnuc1 ndrop
34 DST_A3_mixnuc1 ndrop
35 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ndrop
When, I printed fieldname(mm) for mm=8,14,35 @ line 263 in ndrop.F90; I guess it is never assigned any value or initialized.
Second, it could be that it should not loop over these numbers. Please could you check and update.
Further adding to the picture: as far as I can tell, none of my integrations on tetralith, including the NFHIST cases for CMIP6, are reproducible bfb with default compiler options (i.e. -O2 doe fortran), either from existing restarts or from default i.c. I have no reproducibility test with -O0 option.
I am investigating the bug with different tools (like the Intel Inspector) for memory and thread checking and debugging. I think I am getting there. That now seems to work on a virtual machine.
@j34ni A temporary solution might be to use one 3D SO2--emission file, which will contain the standard 3D emissions + Pinatubo explosion emission. Would you like me to create such a file?
@DirkOlivie We can give it a go
@MichaelSchulzMETNO @DirkOlivie @monsieuralok @tto061 I eventually got NorESM working in the Conda environment (with a GNU compiler) and did not manage to make it to crash yet!
There may be something very wrong with the Intel 2018 compiler, as was already the case when I was running the Variable Resolution CESM (for which I ended up using Intel 2019)
@j34ni Did you / could you explain how one can run NorESM in a conda enviroment? Is that in the NorESM2 documentation already? ( I mean thats really interesting to have !!)
@j34ni Really great news that you can run NorESM in a Conda environment. It is going to be interesting to see scaling results.
@MichaelSchulzMETNO At the moment that has not been much documented, it is still work in progress building on the "conda cesm" recipe. That was mainly used for teaching purposes (to learn how to run an ESM on Galaxy), and for development (without having to wait in a queue). However a proper "conda noresm" will be made available soon, that will allow a simple installation and contain what's needed to run the model (including configuration files, the Microsoft Kernel Library instead of Blas/Lapack, etc.) on generic platform first, and after that on an HPC.
@oyvindseland Yes, we will have to evaluate the scalability on an HPC (so far that only used small configurations on virtual machines with a single node), Betzy comes at the perfect time...
@MichaelSchulzMETNO @DirkOlivie @monsieuralok @tto061 Some of the problems occur at the very beginning of a run: initialization issues (obviously) but also non-BFB reproducibility and even crashes due to NaNs or INFs.
To test that quickly:
I did that many times with CESM and the 3 simulations provide identical results, systematically.
So far with NorESM that only worked with the GNU (for instance 9.3.0) and Intel (2019.5) compilers, not with Intel 2018, whether it makes use of Alok's SourceMods or not.
That is not meant to replace a long run, but it is much faster to evaluate the effect of various fixes: if the 3 simulations do not provide identical results after one time step, there is no need to waste more resources. However if they do provide identical results, the simulation can always fail later.
Adding emission files significantly affects the reproducibility of the simulations, even if they only contain zeros (whether these zeros are defined as float or as double precision).
This is very likely linked to the problem that occurred in the summer 2019 (while carrying on the CMIP6 runs) although at that time it made the model crash more/less randomly, which is not the case here.