Open andrewgettelman opened 1 year ago
From Hugh Morrison:
The crash could simply be from not checking for a non-zero denominator in the nr conservation. This is explicitly checked for qr (I remember when that caused a crash and had to be added, this was like 15 years ago..) and qs. Interestingly, this is not done for any other microphysical state variables. It's possible that all the nr sink terms = 0 while dum > nr, and this never happens with the original code and therefore didn't cause a divide by 0. In which case we should add a check like we already are for qr. However, this may also be related to the inconsistency of using total rather than in-precip (in-cloud) values as noted above.
The emulator processes should definitely be included in the conservation check. I had assumed that the emulator process tendencies were stored simply as prc, pra, nprc, etc. and passed through the conservation checker the same as the kk rates. But it looks like this isn't the case with proc_rates%qrtend_TAU, proc_rates%qctend_TAU, etc.?
From Gagne:
It looks like nragg is 0 if qric < qsmall (1e-18) https://github.com/ESCOMP/PUMAS/blame/fbcb46a91b5b3f1e84c898a8b00f91e69b0fe4db/micro_pumas_utils.F90#L2166. Should that be changed? I am trying to run the code now without the nr ratio check and will see if things blow up in interesting ways…
The version I have been experimenting with is in /glade/work/dgagne/CAM with modifications to PUMAS in
/glade/work/dgagne/CAM/src/physics/pumas.
I am running SCAM with the script at /glade/work/dgagne/scam_scripts/scam_mlcont_ported6.csh
.
I have been using the PUMAS pumas_ml_tau_ported branch as the basis of my experiments (https://github.com/ESCOMP/PUMAS/tree/pumas_ml_tau_ported).
Greetings. I'll confirm what David John has said thus far:
I took a look at the current code, and the original unported code that ran.
I'm surprised this even worked!
I will start with rebuilding the TAU code to correct these issues (the ratios for one, I need a strategy for that). Then we can retrain, and work on the emulator code.
This may take me a few days since I am pretty busy.
@andrewgettelman - Feel free to either open a PR back to the https://github.com/ESCOMP/PUMAS/tree/pumas_ml_tau_ported branch or just commit/push your changes directly back to this branch if you feel comfortable doing either of these. I am not currently doing any cleanup, so we won't trample on each other. Alternatively point me to your source mods and I can bring them into this branch once you are done.
Question for the group, especially @cacraigucar and @Katetc:
I think we need to re-design the tendencies here to be consistent. I am thinking that:
nprc1,nprc,prc,pra,npra
or create new variables?
Only these variables are ratio checked and that is what is output is what is applied to the state. The diagnostics will have qctend_kk2000, qctend_TAU, etc....Thoughts?
Not sure this is relevant but reported by Brian Dobbins:
I did just try the same code that fails by default with the Intel 2022.1 compiler (switched from Intel 2019, the default), and it ran without a crash. Still in DEBUG mode, I literally just swapped the 'intel/19.1.1' entry in env_mach_specific.xml with 'intel/2022.1', and ran, by hand. Prior to this it crashed on step 881, now it's completed, though still gives warnings about significant water conservation errors.
Runs here: /glade/scratch/bdobbins/scam_ml_crash.arm97/run
Failed: /glade/scratch/bdobbins/scam_ml_crash.arm97/run/atm.log.9486121.chadmin1.ib0.cheyenne.ucar.edu.230418-144729 Success: /glade/scratch/bdobbins/scam_ml_crash.arm97/run/atm.log.230418-152621
I ran the latter by hand, interactively, hence the lack of a full name on the job.
So perhaps there are scientific and technical issues.
Okay, made an initial assessment. I'm going to start with Goal #1: get 'tau' working. We will probably want to retrain before we deal with the emulator code.
Steps:
Comments welcome on this plan.
@andrewgettelman I think that plan sounds reasonable. Getting TAU working with the right output variables and ratio checks makes a lot of sense as a first step. I imagine we will need new training data for the emulator after that. With a much better idea of the constraints going into the model, the retrained emulator should be able to be both simpler and more performant.
As a way of joining the convo a few naive questions:
Hi @mvanlierwalq...
- The in-cloud nc/qc/nr/qr quantities are the "prime" quantities listed in MG2008, not the double-prime, right?
Yes, should be.
- the originial "unported" (I'm not sure what this means) code was both trained on grid-mean qc/nc/qr/nr, and also was run on the same grid-mean qc/nc/qr/nr, rather than in-cloud values. Is that right?
I think so: the original 'unported' (sorry 'ported' means ported to newer version, so unported is old code from the ML paper: Gettelman et al 2021) code had the TAU bin operation occurring on in-cloud values. The emulator was operating on grid means, so we have to figure out if something was lost in translation. Always a problem when you are jumping back into it. There probably is a logic here I need to understand.
- Is cloud fraction an input to the emulator? sorry I should just look at your paper.
Not sure: @djgagne will have to answer this.
Adding myself to thread.
- Is cloud fraction an input to the emulator? sorry I should just look at your paper.
Cloud fraction and precipitation fraction were both inputs to the emulator in the original paper and in the updated emulator being tested. The emulator was blowing up pretty consistently before they were added, so the neural network did figure out how to use them approximately. I do think the emulator will have a much easier time learning the relationship once the in-cloud values are used for training.
Jumping in as well
replying to @cacraigucar
If I pull over all of the revised files in /glade/u/home/andrew/cesm_ch/machlearn/mods_ml_tau_ported2, will that capture all of your (and DJ's) changes?
Yes
One item I came across today is that there are a number of routines which don't appear to be used. Can the following be removed from module_neural_net.F90 (and I don't need to clean them up!): standard_scaler_transform standard_scaler_inverse_transform minmax_scaler_transform minmax_scaler_inverse_transform print_2d_array
from @djgagne replying to @cacraigucar
Regarding the extra functions in module neural network that aren’t currently being used: standard_scaler_transform and minmax_scaler_transform and their inverses are used for neural networks that have a different pre-processing procedure for their inputs and outputs. The latest emulator uses the quantile transform. They aren’t strictly necessary for this particular setup but may be useful to keep if other neural net emulators are used within CAM or CESM. I would recommend keeping them for now. print_2d_array is for debugging purposes and could be removed.
Just a quick update:
@andrewgettelman - Can you point me to your current code mods that I should use? I would like to commit to both the PUMAS branch as well as the CAM branch, so we are both working with the same code base.
Okay, that seems to complete and give me sensible answers in 3D. Not exactly the same (expected), but at least it looks like earth (sensible climate).
@cacraigucar - code is here: /glade/u/home/andrew/cesm_ch/machlearn/mods_ml_tau_ported2
@djgagne - I think I can now try to generate a run for training data. I will obviously output tendencies for qctend_TAU, nctend_TAU and nrtend_TAU.
What other variables should I provide as instantaneous output? Do you recall what you used in the last iteration and how frequently? Also how much data went into it. If you refresh my memory I can generate new training data.
Maybe it's best to check this code in so I can run from an actual tag/branch as Cheryl suggests....
Summary of the major issues fixed in the code:
I am glad to hear that the code changes appear to have fixed the problems!
The last training simulation I have output instantaneous information every 23 hours for 2 years so that it would step through the day. The files are located at /glade/p/cisl/aiml/dgagne/TAU_run6_lim10ppm
for reference.
I am currently using the following input variables:
Will you save out the grid cell or in-cloud versions of QC, QR, NC, and NR? I would like to train on the in-cloud versions but could also derive them from the total grid cell versions (or derive total grid cell from in-cloud) as long as I have lcldm, precip_frac, and RHO_CLUBB.
I also have the following diagnostics in the output file: lamc, lamr, n0r, pgam. I don't think they are needed for the emulator at this point if you want to drop them, but they could be useful.
For outputs, I would want the following variables:
With the quantile transform method, I might be able to predict the tendencies directly, which would simplify things further. If I don't have to worry about directly conserving mass, then I can make a single neural net based on the tendencies and get the output "close enough" for the conservation checks to clip them.
Thanks @djgagne - I will check the code and see about generating the output. I think I can get all these variables to be in-cloud (input, output and tendencies) to be consistent. That's NOT quite how it is done now, so I am checking this.
If it is simpler to do the tendencies directly and clean them up later, I am happy to try that....
Just an update: in double checking everything I found a mistake in my ordering that was neglecting initial rain in the TAU code. This then uncovered a few other issues in my original implementation. I have it running in SCAM now, but it's crashing in the 3D code after 1 day or so. Still working on this: think I need to look at guardrails again and the original implementation.
Okay. I've validated my current version is BFB with the original KK2000. I also put in a conservation check in the code so the TAU collection code will not produce tendencies that result in negative number or mass.
However, the tendency from autoconversion and accretion can now be positive or negative.
@hmorrison100 - It seems like the 'ratio' conservation checks (e.g. for autoconversion) are assuming that all tendencies have only one sign, and that is now not the case with stochastic collection.
E.g.:
dum = ((prc(i,k)+pra(i,k)+mnuccc(i,k)+mnucct(i,k)+msacwi(i,k)+ &
psacws(i,k)+bergs(i,k)+qmultg(i,k)+psacwg(i,k)+pgsacw(i,k))*lcldm(i,k)+ &
berg(i,k))*deltat
if (dum.gt.qc(i,k)) then
If stochastic collection is the only thing operating and it increases nc (prc < 0) then dum < 0 and no conservation is done at this point as far as I can tell (no scaling of tendencies). I guess that is okay (since I have checked that there cannot be non-conservation from stochastic collection itself)....
Thoughts? SCAM runs a standard case, but the 3D model seems to be dying on the first time step.
Last comment of the day: if I 'clip' the stochastic collection tendencies to only be one sign, then the code runs through a month. So I think we might need to think about how to adjust the conservation checks to accommodate tendencies that can be positive or negative for 'autoconversion'
I have brought in Andrew's changes (as of today) and made commits to both the PUMAS branch and my CAM branch. To check them out:
git clone https://github.com/cacraigucar/CAM cam_ml_tau_ported cd cam_ml_tau_ported git checkout 9757d928
I will now proceed with implementing additional code cleanup and will give y'all the new hash when that code is committed
Hi @cacraigucar - I did a lot of development yesterday. The current version is /glade/u/home/andrew/cesm_ch/machlearn/mods_ml_tau_ported4 which has a few more modifications and gets the code back to BFB with 'kk2000'
This version now runs TAU about 6 days in 3-D before dying. Something in the state is going wrong the conservation checkers are not catching and then eventually CLUBB fails.
I have a new hash for my latest code commit. This contains Andrews mods_ml_tau_ported4 files from mid-day today as well as cleanup that I've made.
The hash is 93eebd0.
I'm using the ne30pg3_ng30pg3_mg17 grid and compset F2000dev for my tests and I am setting DEBUG=TRUE.
This code runs successfully for 9 time steps using "emulated" and the three files that I have from a while ago for the stochastic_emulated_filename... files. These are files in the /glade/work/dgagne/cam_mp_run6_quantile_nn directory.
Unfortunately my "tau" run crashes with a floating point exception on the first time step. The partial traceback is:
88:MPT: #1 0x00002afe1e234c66 in mpi_sgi_system (
88:MPT: #2 MPI_SGI_stacktraceback (
88:MPT: header=header@entry=0x7ffc08846f90 "MPT ERROR: Rank 88(g:88) received signal SIGFPE(8).\n\tProcess ID: 12846, Host: r8i7n1, Program: /glade/scratch/cacraig/test_camdev_ne30_try56_D_tau/bld/cesm.exe\n\tMPT Version: HPE MPT 2.25 08/14/21 03:"...) at sig.c:340
88:MPT: #3 0x00002afe1e234e66 in first_arriver_handler (signo=signo@entry=8,
88:MPT: stack_trace_sem=stack_trace_sem@entry=0x2afe2db40080) at sig.c:489
88:MPT: #4 0x00002afe1e2350f3 in slave_sig_handler (signo=8, siginfo=
The crash is the same statement that has been giving me trouble. The ratio calculation for rain number (nr):
ratio = (nr(i,k)*rdeltat+nprc(i,k)*lcldm(i,k))/precip_frac(i,k)/ &
(-nsubr(i,k)+npracs(i,k)+nnuccr(i,k)+nnuccri(i,k)-nragg(i,k)+npracg(i,k)+ngracs(i,k))*omsm
However, I have tried trapping for a failure here, and then I can get it to go farther but it crashes in CLUBB (~200 timesteps for me)
Is there a way to set nr to 1e-12 if the predicted tendency pushes it below a certain threshold? When debugging before, I noticed that sometimes the updated nr value would be 1e-12 and sometimes it would be 0. I also noticed that the nr ratio check would produce divide by 0 errors because all the tendencies in the denominator ended up being 0. Is there an epsilon term we could add to the denominator to prevent those divide by 0 errors?
@djgagne @andrewgettelman Yes, we might want to add an epsilon to prevent division by zero for the Nr check. However, there seems to be another bug, see *** in the comments below.
Below are some comments after looking at the code.
NOTE: line numbers below refer to micro_pumas_v1.F90 in the "pumas_ml_tau_ported" branch. I see you added initialization to 0 for all proc_rates%qctend, etc. (this was not in original code, but these pointers were not used, just process rates directly). This should be fine, just something that I saw was new. Is there a limiter on proc_rates% coming out of the TAU tendency calculations? i.e., are they clipped to prevent the wrong sign? This could cause issues with the conservation checker otherwise. I don’t think we should apply limits after sedimentation when warm_rain == ‘tau’ (lines ~4213-4240). This doesn’t seem necessary just for TAU -- these checks are already taken care of earlier in the original code. The TAU/emulator part doesn't change anything after tendencies are added, i.e. before sedimentation, so it’s not clear why any additional checks are needed at the end of the micro subroutine. ***But it’s not just an unnecessary check, it may be doing harm because there seems to be a conservation issue in the added checker --> we shouldn’t remove mass if number goes to 0 but mass doesn’t go to 0, this will violate mass conservation. This is done on lines 4214 and 4218. Again, mass/number checks aren’t needed anyway as this is already taken care of in the original code on lines 4206 and 4208. Also, this is not an effective radius for those limiters, it’s a mean volume radius. At any rate, like I said there’s already size checks being applied in the original code before this. These size checks are already done earlier in the original code, around lines 4141 to 4150 for rain (call size_dist_param_basic using the dummy Nr “dum_2D”). For cloud, this is done around lines 4082 to 4100. Same comments as above for the sign and size checks for TAU around lines 4500 to 4600 at the end of the micro routine.
On an unrelated topic, I am bringing the KBARF file into PUMAS as @andrewgettelman suggested in an email discussion a couple of months ago. I would suggest that at the very least we rename this file to KBARF.txt (since it is an ASCII file). I would suggest two modifications and have one question:
Question: Should I create a /data directory and have this file reside in there?
I can make the changes based on the answers I receive. I do need answers to questions 1 and 2 as I don't know the answer myself
Hi @cacraigucar - we went over this before. KBARF is the name used in the original code. we could call it KBARF_tau_kernel.txt
I'd probably prefer to leave the file unmodified so we don't have to change the code or the file.
Is it small enough to put it in the code directory? Or does it need to go somewhere else?
@andrewgettelman Thanks for the new name for the file.
I won't add anything to file contents itself (though it wouldn't be hard to read a couple of lines of text before reading the actual contents).
Yes, I am planning on putting it into the PUMAS github repository since it is so small. I thought it might be nice to "hide" the file by creating a /data directory in PUMAS and putting the file in there. Other small "data" files that PUMAS might expand to needing in the future could also go in this directory. Otherwise, I can just place it in right beside all of the PUMAS code and not make a /data directory.
Let me know your thoughts.
Just a quick update: after discussions with @hmorrison100 I managed to get the code working, and TAU now runs through a month, producing reasonable values. I'll do some more checking and cleanup and try to get it ready by tomorrow for checking in something that works, and thinking about trying a training run for @djgagne . Not sure if I will get that done before leaving for travel tomorrow.
Thanks again @hmorrison100 !
@andrewgettelman - please point us to your latest code before you leave, if you can. I can work on cleaning it up if needed (removing print/write statements, etc.)
@cacraigucar for the moment let's just put the file in the PUMAS directory.
Where are you going to put a description if not in the file? In a README?
The description can read:
KBARF_tau_kernel.txt contains coefficients for the stochastic collection kernel used by the TAU stochastic collection code, invoked in PUMAS with micro_pumas_warm_rain = 'tau'
I've cleaned up my code a bit, verified it runs in SCAM, and am starting a 2 year test that should give training data for @djgagne
So even if this fails, it's probably as far as I am going to get this week, and you can tag it: /glade/u/home/andrew/cesm_ch/machlearn/mods_ml_tau_ported5
The code completed a year. It finished by something happened with a restart file and did not auto-restart. I'm off today for a week, and don't have time to deal with that.
But I have 1 year of training data @djgagne if you want to look at it or take a first cut at it: /glade/scratch/andrew/archive/cam_ml_ported5_tau/atm/hist/h1
These are instantaneous fields, but we may have some issues with sub-stepping (I cannot forget how we shut off all substeps).
For now if it is not too much effort it might be good to do a quick update of the emulator just to see if I can get the code to run with it before we finalize and check everything in.
@andrewgettelman Are the input QC, QR, etc. fields in-cloud or for the whole grid cell? I can get the initial data processed through my scripts either today or Monday.
Thanks @djgagne - Everything (input, output and tendencies) are in-cloud now,
Here is a quick comparison of the new code climatology, compared to the 2021 paper version. Note the 'control' is very different now with PUMAS (control = KK2000). Thus this seems to be doing what we think it should be doing.
@djgagne if you have any questions on the data used for the emulator, let me know. If you get a new emulator file, I can test it (or I can point you to the code to integrate it. Thanks!
@cacraigucar, I'm not sure what we want to do with this now. I guess there is not a strong urge to move it into CAM until we get the emulator sorted out.
@andrewgettelman I met with Wayne earlier this week to go over the emulator training code and will work on getting an updated baseline emulator trained. I have had a large number of meetings and haven't had time to do the processing yet but was at least able to take s quick look at and copy the CAM output files.
From David John Gagne:
Going through the PUMAS code, I’ve found a few more issues that may be affecting both TAU and the emulator. They appear to be fixable if we can get all of our ducks in a row.
On the machine learning side, I think it would make a lot of sense to re-train the emulator with in-cloud input values. Should I assume lcldm and precip_frac remain constant between the input and output qc, qr, nc, and nr when converting the calculations to in-cloud?