Updating ML Warm Rain Emulators

andrewgettelman commented 1 year ago

From David John Gagne:

Replies from Gettelman

Going through the PUMAS code, I’ve found a few more issues that may be affecting both TAU and the emulator. They appear to be fixable if we can get all of our ducks in a row.

It looks like both kk2000 and the TAU emulator are using the in-cloud qc, qr, nc, and nr values as input while the emulator is using the total grid cell values. When I made the decision to use total instead of in-cloud, I wasn’t properly familiar with how the code worked. Switching to in-cloud values for the emulator would likely fix a lot of issues and simplify the model further.

It is an inconsistency that maybe we should fix. It would require retraining (as output would have to be divided by cloud or precipitation fraction which we would output too). But it probably shouldn't be causing the failures.

The trigger threshold for kk2000 is based on the incloud qc threshold of 1e-8 while the emulator is using a total grid cell qc threshold of 1e-6. The emulator is thus not running a lot of times when it should be running. That is likely contributing to the weird failures in the first time step.

Interesting. That might be an issue. But not sure if that would be a weird failure mode. It should just produce zeros and move on.

The TAU tendency outputs are currently going to proc_rates%qtend_TAU_diag rather than proc_rates%qtend_TAU and are thus not ever updating the state variables. This is probably the main contributor to why TAU is failing.

These process rates are purely diagnostic, unless they are being used to update the state variables. In which case this is a problem. Not sure if the zeros are just coming from this mismatch or not.

Should TAU and the emulator update prc, nprc, and nprc1 instead of or in addition to their special tendency variables? The ratio checks only use prc, nprc, and nprc1 and don’t account for the TAU or emulator variables, which may be causing some of the failures at that stage. The microphysics state variables don’t get formally updated until after the ratio checks. Alternatively, should there be a separate ratio check using the total tendency from TAU/emulator or just add that into the existing ratio formula since otherwise the other variables should be 0?

I will have to look again but I thought the emulator was updating different variables than prc and nprc. If it is updating them (which should be tendencies) then there should be ratio checks. So I thought it was the last statement you made: just add them into the existing formula since the other variables are zero.

Should accrete_cloud_water_rain and self_collection_rain not be called when TAU or the emulator are running? Currently they are always being called. I wonder if this might be causing some of the lack of mass conservation in the TAU training data.

Accretion should not be called, and I did not think it was. Self_collection I think we can leave.

On the machine learning side, I think it would make a lot of sense to re-train the emulator with in-cloud input values. Should I assume lcldm and precip_frac remain constant between the input and output qc, qr, nc, and nr when converting the calculations to in-cloud?

They are treated as constants.

andrewgettelman commented 1 year ago

From Hugh Morrison:

Surprised that the emulator is not using in-cloud as inputs -- seems that is what should be done. Like Andrew said, this might not explain the crash in the conservation check code. However, before the process calculations there's a check so that nric = 0 if qric < qsmall (and qric in turn is also set to 0). This check is not applied to the total qr and nr until the end of the microphysics (after the state update and sedimentation). Maybe this could explain some inconsistency with the emulator that's blowing up the conservation check.

The crash could simply be from not checking for a non-zero denominator in the nr conservation. This is explicitly checked for qr (I remember when that caused a crash and had to be added, this was like 15 years ago..) and qs. Interestingly, this is not done for any other microphysical state variables. It's possible that all the nr sink terms = 0 while dum > nr, and this never happens with the original code and therefore didn't cause a divide by 0. In which case we should add a check like we already are for qr. However, this may also be related to the inconsistency of using total rather than in-precip (in-cloud) values as noted above.
The emulator processes should definitely be included in the conservation check. I had assumed that the emulator process tendencies were stored simply as prc, pra, nprc, etc. and passed through the conservation checker the same as the kk rates. But it looks like this isn't the case with proc_rates%qrtend_TAU, proc_rates%qctend_TAU, etc.?

andrewgettelman commented 1 year ago

From Gagne:

It looks like nragg is 0 if qric < qsmall (1e-18) https://github.com/ESCOMP/PUMAS/blame/fbcb46a91b5b3f1e84c898a8b00f91e69b0fe4db/micro_pumas_utils.F90#L2166. Should that be changed? I am trying to run the code now without the nr ratio check and will see if things blow up in interesting ways…

The version I have been experimenting with is in /glade/work/dgagne/CAM with modifications to PUMAS in

/glade/work/dgagne/CAM/src/physics/pumas. I am running SCAM with the script at /glade/work/dgagne/scam_scripts/scam_mlcont_ported6.csh. I have been using the PUMAS pumas_ml_tau_ported branch as the basis of my experiments (https://github.com/ESCOMP/PUMAS/tree/pumas_ml_tau_ported).

andrewgettelman commented 1 year ago

Greetings. I'll confirm what David John has said thus far:

I took a look at the current code, and the original unported code that ran.

The emulator did use qc and nc, while qcic and ncic was input for TAU and kk2000
Ratios are not checked either for TAU and emulated. Goes right to a separate variable.
Original naming was funny for the diagnostic outputs, hence the confusion.

I'm surprised this even worked!

I will start with rebuilding the TAU code to correct these issues (the ratios for one, I need a strategy for that). Then we can retrain, and work on the emulator code.

This may take me a few days since I am pretty busy.

cacraigucar commented 1 year ago

@andrewgettelman - Feel free to either open a PR back to the https://github.com/ESCOMP/PUMAS/tree/pumas_ml_tau_ported branch or just commit/push your changes directly back to this branch if you feel comfortable doing either of these. I am not currently doing any cleanup, so we won't trample on each other. Alternatively point me to your source mods and I can bring them into this branch once you are done.

andrewgettelman commented 1 year ago

Question for the group, especially @cacraigucar and @Katetc:

I think we need to re-design the tendencies here to be consistent. I am thinking that:

Make diagnostic tendencies for kk2000 (MG2 now) and TAU (emulated or regular)
Either load these immediately into existing variables: nprc1,nprc,prc,pra,npra or create new variables? Only these variables are ratio checked and that is what is output is what is applied to the state. The diagnostics will have qctend_kk2000, qctend_TAU, etc....

Thoughts?

andrewgettelman commented 1 year ago

Not sure this is relevant but reported by Brian Dobbins:

I did just try the same code that fails by default with the Intel 2022.1 compiler (switched from Intel 2019, the default), and it ran without a crash. Still in DEBUG mode, I literally just swapped the 'intel/19.1.1' entry in env_mach_specific.xml with 'intel/2022.1', and ran, by hand. Prior to this it crashed on step 881, now it's completed, though still gives warnings about significant water conservation errors.

Runs here: /glade/scratch/bdobbins/scam_ml_crash.arm97/run

Failed: /glade/scratch/bdobbins/scam_ml_crash.arm97/run/atm.log.9486121.chadmin1.ib0.cheyenne.ucar.edu.230418-144729 Success: /glade/scratch/bdobbins/scam_ml_crash.arm97/run/atm.log.230418-152621

I ran the latter by hand, interactively, hence the lack of a full name on the job.

So perhaps there are scientific and technical issues.

andrewgettelman commented 1 year ago

Okay, made an initial assessment. I'm going to start with Goal #1: get 'tau' working. We will probably want to retrain before we deal with the emulator code.

Steps:

New diagnostic variables for kk200, tau
Load autconversion and accretion (kk2000 or tau) into nprc1,nprc,prc,pra,npra right after call.
Let these variables go through all the ratio checks.

Comments welcome on this plan.

djgagne commented 1 year ago

@andrewgettelman I think that plan sounds reasonable. Getting TAU working with the right output variables and ratio checks makes a lot of sense as a first step. I imagine we will need new training data for the emulator after that. With a much better idea of the constraints going into the model, the retrained emulator should be able to be both simpler and more performant.

mvanlierwalq commented 1 year ago

As a way of joining the convo a few naive questions:

The in-cloud nc/qc/nr/qr quantities are the "prime" quantities listed in MG2008, not the double-prime, right?
the originial "unported" (I'm not sure what this means) code was both trained on grid-mean qc/nc/qr/nr, and also was run on the same grid-mean qc/nc/qr/nr, rather than in-cloud values. Is that right?
Is cloud fraction an input to the emulator? sorry I should just look at your paper.

andrewgettelman commented 1 year ago

Hi @mvanlierwalq...

The in-cloud nc/qc/nr/qr quantities are the "prime" quantities listed in MG2008, not the double-prime, right?

Yes, should be.

the originial "unported" (I'm not sure what this means) code was both trained on grid-mean qc/nc/qr/nr, and also was run on the same grid-mean qc/nc/qr/nr, rather than in-cloud values. Is that right?

I think so: the original 'unported' (sorry 'ported' means ported to newer version, so unported is old code from the ML paper: Gettelman et al 2021) code had the TAU bin operation occurring on in-cloud values. The emulator was operating on grid means, so we have to figure out if something was lost in translation. Always a problem when you are jumping back into it. There probably is a logic here I need to understand.

Is cloud fraction an input to the emulator? sorry I should just look at your paper.

Not sure: @djgagne will have to answer this.

hmorrison100 commented 1 year ago

Adding myself to thread.

djgagne commented 1 year ago

Is cloud fraction an input to the emulator? sorry I should just look at your paper.

Cloud fraction and precipitation fraction were both inputs to the emulator in the original paper and in the updated emulator being tested. The emulator was blowing up pretty consistently before they were added, so the neural network did figure out how to use them approximately. I do think the emulator will have a much easier time learning the relationship once the in-cloud values are used for training.

wkchuang commented 1 year ago

Jumping in as well

andrewgettelman commented 1 year ago

replying to @cacraigucar

If I pull over all of the revised files in /glade/u/home/andrew/cesm_ch/machlearn/mods_ml_tau_ported2, will that capture all of your (and DJ's) changes?

Yes

One item I came across today is that there are a number of routines which don't appear to be used. Can the following be removed from module_neural_net.F90 (and I don't need to clean them up!): standard_scaler_transform standard_scaler_inverse_transform minmax_scaler_transform minmax_scaler_inverse_transform print_2d_array

andrewgettelman commented 1 year ago

from @djgagne replying to @cacraigucar

Regarding the extra functions in module neural network that aren’t currently being used: standard_scaler_transform and minmax_scaler_transform and their inverses are used for neural networks that have a different pre-processing procedure for their inputs and outputs. The latest emulator uses the quantile transform. They aren’t strictly necessary for this particular setup but may be useful to keep if other neural net emulators are used within CAM or CESM. I would recommend keeping them for now. print_2d_array is for debugging purposes and could be removed.

andrewgettelman commented 1 year ago

Just a quick update:

I've recoded the approach to load all tendencies into the existing autoconversion fields. This ensures all conservation checks are enabled.
I have verified that this is bit for bit with the existing code for kk2000 and sb2001 versions in both 1D (SCAM) and 3D (one month F2000 case)
I am now working on the TAU code: initially it was going about halfway through the ARM case and crashing. The tendencies were coming out the wrong sign. I switched things to make sure they were the correct sign, essentially making everything positive. I will probably need a comment on whether making the tendencies positive with an absolute value statement) is appropriate. This cleans up very small negative numbers (good) and probably helps stability.
The revised TAU code runs now in SCAM producing similar results to the KK2000 code (looks sensible). I am now trying a 1 month 3D simulation on Cheyenne with debug on (slow) just to see if I can get it to run through and produce 'reasonable' results.

cacraigucar commented 1 year ago

@andrewgettelman - Can you point me to your current code mods that I should use? I would like to commit to both the PUMAS branch as well as the CAM branch, so we are both working with the same code base.

andrewgettelman commented 1 year ago

Okay, that seems to complete and give me sensible answers in 3D. Not exactly the same (expected), but at least it looks like earth (sensible climate).

@cacraigucar - code is here: /glade/u/home/andrew/cesm_ch/machlearn/mods_ml_tau_ported2

@djgagne - I think I can now try to generate a run for training data. I will obviously output tendencies for qctend_TAU, nctend_TAU and nrtend_TAU.

What other variables should I provide as instantaneous output? Do you recall what you used in the last iteration and how frequently? Also how much data went into it. If you refresh my memory I can generate new training data.

Maybe it's best to check this code in so I can run from an actual tag/branch as Cheryl suggests....

andrewgettelman commented 1 year ago

Summary of the major issues fixed in the code:

Conservation checks for TAU (and will be for emulated)
Getting the sign right, and clipping tendencies of the wrong sign near zero. Particularly for number tendencies (which don't have the same hard core conservation requirement). I did this by just forcing an absolute value to be loaded into the tendency arrays. Not sure if that is the best way forward....but will allow tau and emulator code to produce small values either side of zero and have them 'clipped' and then put into conservation.

djgagne commented 1 year ago

I am glad to hear that the code changes appear to have fixed the problems!

The last training simulation I have output instantaneous information every 23 hours for 2 years so that it would step through the day. The files are located at /glade/p/cisl/aiml/dgagne/TAU_run6_lim10ppm for reference.

I am currently using the following input variables:

QC_TAU_in
QR_TAU_in
NC_TAU_in
NR_TAU_in
RHO_CLUBB
lcldm
precip_frac

Will you save out the grid cell or in-cloud versions of QC, QR, NC, and NR? I would like to train on the in-cloud versions but could also derive them from the total grid cell versions (or derive total grid cell from in-cloud) as long as I have lcldm, precip_frac, and RHO_CLUBB.

I also have the following diagnostics in the output file: lamc, lamr, n0r, pgam. I don't think they are needed for the emulator at this point if you want to drop them, but they could be useful.

For outputs, I would want the following variables:

qctend_TAU
qrtend_TAU
nctend_TAU
nrtend_TAU
QC_TAU_out
QR_TAU_out
NC_TAU_out
NR_TAU_out

With the quantile transform method, I might be able to predict the tendencies directly, which would simplify things further. If I don't have to worry about directly conserving mass, then I can make a single neural net based on the tendencies and get the output "close enough" for the conservation checks to clip them.

andrewgettelman commented 1 year ago

Thanks @djgagne - I will check the code and see about generating the output. I think I can get all these variables to be in-cloud (input, output and tendencies) to be consistent. That's NOT quite how it is done now, so I am checking this.

If it is simpler to do the tendencies directly and clean them up later, I am happy to try that....

andrewgettelman commented 1 year ago

Just an update: in double checking everything I found a mistake in my ordering that was neglecting initial rain in the TAU code. This then uncovered a few other issues in my original implementation. I have it running in SCAM now, but it's crashing in the 3D code after 1 day or so. Still working on this: think I need to look at guardrails again and the original implementation.

andrewgettelman commented 1 year ago

Okay. I've validated my current version is BFB with the original KK2000. I also put in a conservation check in the code so the TAU collection code will not produce tendencies that result in negative number or mass.

However, the tendency from autoconversion and accretion can now be positive or negative.

@hmorrison100 - It seems like the 'ratio' conservation checks (e.g. for autoconversion) are assuming that all tendencies have only one sign, and that is now not the case with stochastic collection.

E.g.:

           dum = ((prc(i,k)+pra(i,k)+mnuccc(i,k)+mnucct(i,k)+msacwi(i,k)+ &
             psacws(i,k)+bergs(i,k)+qmultg(i,k)+psacwg(i,k)+pgsacw(i,k))*lcldm(i,k)+ &
             berg(i,k))*deltat
        if (dum.gt.qc(i,k)) then

If stochastic collection is the only thing operating and it increases nc (prc < 0) then dum < 0 and no conservation is done at this point as far as I can tell (no scaling of tendencies). I guess that is okay (since I have checked that there cannot be non-conservation from stochastic collection itself)....

Thoughts? SCAM runs a standard case, but the 3D model seems to be dying on the first time step.

andrewgettelman commented 1 year ago

Last comment of the day: if I 'clip' the stochastic collection tendencies to only be one sign, then the code runs through a month. So I think we might need to think about how to adjust the conservation checks to accommodate tendencies that can be positive or negative for 'autoconversion'

cacraigucar commented 1 year ago

I have brought in Andrew's changes (as of today) and made commits to both the PUMAS branch and my CAM branch. To check them out:

git clone https://github.com/cacraigucar/CAM cam_ml_tau_ported cd cam_ml_tau_ported git checkout 9757d928

I will now proceed with implementing additional code cleanup and will give y'all the new hash when that code is committed

andrewgettelman commented 1 year ago

Hi @cacraigucar - I did a lot of development yesterday. The current version is /glade/u/home/andrew/cesm_ch/machlearn/mods_ml_tau_ported4 which has a few more modifications and gets the code back to BFB with 'kk2000'

This version now runs TAU about 6 days in 3-D before dying. Something in the state is going wrong the conservation checkers are not catching and then eventually CLUBB fails.

cacraigucar commented 1 year ago

I have a new hash for my latest code commit. This contains Andrews mods_ml_tau_ported4 files from mid-day today as well as cleanup that I've made.

The hash is 93eebd0.

I'm using the ne30pg3_ng30pg3_mg17 grid and compset F2000dev for my tests and I am setting DEBUG=TRUE.

This code runs successfully for 9 time steps using "emulated" and the three files that I have from a while ago for the stochastic_emulated_filename... files. These are files in the /glade/work/dgagne/cam_mp_run6_quantile_nn directory.

Unfortunately my "tau" run crashes with a floating point exception on the first time step. The partial traceback is: 88:MPT: #1 0x00002afe1e234c66 in mpi_sgi_system ( 88:MPT: #2 MPI_SGI_stacktraceback ( 88:MPT: header=header@entry=0x7ffc08846f90 "MPT ERROR: Rank 88(g:88) received signal SIGFPE(8).\n\tProcess ID: 12846, Host: r8i7n1, Program: /glade/scratch/cacraig/test_camdev_ne30_try56_D_tau/bld/cesm.exe\n\tMPT Version: HPE MPT 2.25 08/14/21 03:"...) at sig.c:340 88:MPT: #3 0x00002afe1e234e66 in first_arriver_handler (signo=signo@entry=8, 88:MPT: stack_trace_sem=stack_trace_sem@entry=0x2afe2db40080) at sig.c:489 88:MPT: #4 0x00002afe1e2350f3 in slave_sig_handler (signo=8, siginfo=, 88:MPT: extra=) at sig.c:565 88:MPT: #5 88:MPT: #6 micro_pumas_v1::micro_pumas_tend (mgncol=13, nlev=32, deltatin=600, t=..., 88:MPT: q=..., qcn=..., qin=..., ncn=..., nin=..., qrn=..., qsn=..., nrn=..., 88:MPT: nsn=..., qgr=..., ngr=..., relvar=..., accre_enhan=..., p=..., pdel=..., 88:MPT: pint=..., cldn=..., liqcldf=..., icecldf=..., qsatfac=..., 88:MPT: qcsinksum_rate1ord=..., naai=..., npccn=..., rndst=..., nacon=..., 88:MPT: tlat=..., qvlat=..., qctend=..., qitend=..., nctend=..., nitend=..., 88:MPT: qrtend=..., qstend=..., nrtend=..., nstend=..., qgtend=..., ngtend=..., 88:MPT: effc=..., effc_fn=..., effi=..., sadice=..., sadsnow=..., prect=..., 88:MPT: preci=..., nevapr=..., am_evp_st=..., prain=..., cmeout=..., deffi=..., 88:MPT: pgamrad=..., lamcrad=..., qsout=..., dsout=..., qgout=..., ngout=..., 88:MPT: dgout=..., lflx=..., iflx=..., gflx=..., rflx=..., sflx=..., qrout=..., 88:MPT: reff_rain=..., reff_snow=..., reff_grau=..., nrout=..., nsout=..., 88:MPT: refl=..., arefl=..., areflz=..., frefl=..., csrfl=..., acsrfl=..., 88:MPT: fcsrfl=..., refl10cm=..., reflz10cm=..., rercld=..., ncai=..., ncal=..., 88:MPT: qrout2=..., qsout2=..., nrout2=..., nsout2=..., drout2=..., dsout2=..., 88:MPT: qgout2=..., ngout2=..., dgout2=..., freqg=..., freqs=..., freqr=..., 88:MPT: nfice=..., qcrat=..., proc_rates=..., errstring=..., tnd_qsnow=..., 88:MPT: tnd_nsnow=..., re_ice=..., prer_evap=..., frzimm=..., frzcnt=..., 88:MPT: frzdep=..., .tmp.ERRSTRING.len_V$879=128) 88:MPT: at /glade/u/home/cacraig/cam_ml_tau_ported/src/physics/pumas/micro_pumas_v1.F90:2638 88:MPT: #7 0x00000000036313b6 in micro_pumas_cam::micro_pumas_cam_tend (state=..., 88:MPT: ptend=..., dtime=600, pbuf=0x2afe44cfcec0) 88:MPT: at /glade/u/home/cacraig/cam_ml_tau_ported/src/physics/cam_dev/micro_pumas_cam.F90:2279 88:MPT: #8 0x00000000039cee28 in microp_driver::microp_driver_tend (state=..., 88:MPT: ptend=..., dtime=600, pbuf=0x2afe44cfcec0) 88:MPT: at /glade/u/home/cacraig/cam_ml_tau_ported/src/physics/cam/microp_driver.F90:189 88:MPT: #9 0x000000000136e42f in physpkg::tphysac (ztodt=1800, cam_in=..., 88:MPT: cam_out=..., state=..., tend=..., pbuf=0x2afe44cfcec0)

andrewgettelman commented 1 year ago

The crash is the same statement that has been giving me trouble. The ratio calculation for rain number (nr):

           ratio = (nr(i,k)*rdeltat+nprc(i,k)*lcldm(i,k))/precip_frac(i,k)/ &
                (-nsubr(i,k)+npracs(i,k)+nnuccr(i,k)+nnuccri(i,k)-nragg(i,k)+npracg(i,k)+ngracs(i,k))*omsm

However, I have tried trapping for a failure here, and then I can get it to go farther but it crashes in CLUBB (~200 timesteps for me)

djgagne commented 1 year ago

Is there a way to set nr to 1e-12 if the predicted tendency pushes it below a certain threshold? When debugging before, I noticed that sometimes the updated nr value would be 1e-12 and sometimes it would be 0. I also noticed that the nr ratio check would produce divide by 0 errors because all the tendencies in the denominator ended up being 0. Is there an epsilon term we could add to the denominator to prevent those divide by 0 errors?

hmorrison100 commented 1 year ago

@djgagne @andrewgettelman Yes, we might want to add an epsilon to prevent division by zero for the Nr check. However, there seems to be another bug, see *** in the comments below.

Below are some comments after looking at the code.

NOTE: line numbers below refer to micro_pumas_v1.F90 in the "pumas_ml_tau_ported" branch. I see you added initialization to 0 for all proc_rates%qctend, etc. (this was not in original code, but these pointers were not used, just process rates directly). This should be fine, just something that I saw was new. Is there a limiter on proc_rates% coming out of the TAU tendency calculations? i.e., are they clipped to prevent the wrong sign? This could cause issues with the conservation checker otherwise. I don’t think we should apply limits after sedimentation when warm_rain == ‘tau’ (lines ~4213-4240). This doesn’t seem necessary just for TAU -- these checks are already taken care of earlier in the original code. The TAU/emulator part doesn't change anything after tendencies are added, i.e. before sedimentation, so it’s not clear why any additional checks are needed at the end of the micro subroutine. ***But it’s not just an unnecessary check, it may be doing harm because there seems to be a conservation issue in the added checker --> we shouldn’t remove mass if number goes to 0 but mass doesn’t go to 0, this will violate mass conservation. This is done on lines 4214 and 4218. Again, mass/number checks aren’t needed anyway as this is already taken care of in the original code on lines 4206 and 4208. Also, this is not an effective radius for those limiters, it’s a mean volume radius. At any rate, like I said there’s already size checks being applied in the original code before this. These size checks are already done earlier in the original code, around lines 4141 to 4150 for rain (call size_dist_param_basic using the dummy Nr “dum_2D”). For cloud, this is done around lines 4082 to 4100. Same comments as above for the sign and size checks for TAU around lines 4500 to 4600 at the end of the micro routine.

cacraigucar commented 1 year ago

On an unrelated topic, I am bringing the KBARF file into PUMAS as @andrewgettelman suggested in an email discussion a couple of months ago. I would suggest that at the very least we rename this file to KBARF.txt (since it is an ASCII file). I would suggest two modifications and have one question:

Rename the file to be something more descriptive
Insert a sentence or two at the top of the file describing the contents and what it is used for

Question: Should I create a /data directory and have this file reside in there?

I can make the changes based on the answers I receive. I do need answers to questions 1 and 2 as I don't know the answer myself

andrewgettelman commented 1 year ago

Hi @cacraigucar - we went over this before. KBARF is the name used in the original code. we could call it KBARF_tau_kernel.txt

I'd probably prefer to leave the file unmodified so we don't have to change the code or the file.

Is it small enough to put it in the code directory? Or does it need to go somewhere else?

cacraigucar commented 1 year ago

@andrewgettelman Thanks for the new name for the file.

I won't add anything to file contents itself (though it wouldn't be hard to read a couple of lines of text before reading the actual contents).

Yes, I am planning on putting it into the PUMAS github repository since it is so small. I thought it might be nice to "hide" the file by creating a /data directory in PUMAS and putting the file in there. Other small "data" files that PUMAS might expand to needing in the future could also go in this directory. Otherwise, I can just place it in right beside all of the PUMAS code and not make a /data directory.

Let me know your thoughts.

andrewgettelman commented 1 year ago

Just a quick update: after discussions with @hmorrison100 I managed to get the code working, and TAU now runs through a month, producing reasonable values. I'll do some more checking and cleanup and try to get it ready by tomorrow for checking in something that works, and thinking about trying a training run for @djgagne . Not sure if I will get that done before leaving for travel tomorrow.

Thanks again @hmorrison100 !

cacraigucar commented 1 year ago

@andrewgettelman - please point us to your latest code before you leave, if you can. I can work on cleaning it up if needed (removing print/write statements, etc.)

andrewgettelman commented 1 year ago

@cacraigucar for the moment let's just put the file in the PUMAS directory.

Where are you going to put a description if not in the file? In a README?

The description can read:

KBARF_tau_kernel.txt contains coefficients for the stochastic collection kernel used by the TAU stochastic collection code, invoked in PUMAS with micro_pumas_warm_rain = 'tau'

I've cleaned up my code a bit, verified it runs in SCAM, and am starting a 2 year test that should give training data for @djgagne

So even if this fails, it's probably as far as I am going to get this week, and you can tag it: /glade/u/home/andrew/cesm_ch/machlearn/mods_ml_tau_ported5

andrewgettelman commented 1 year ago

The code completed a year. It finished by something happened with a restart file and did not auto-restart. I'm off today for a week, and don't have time to deal with that.

But I have 1 year of training data @djgagne if you want to look at it or take a first cut at it: /glade/scratch/andrew/archive/cam_ml_ported5_tau/atm/hist/h1

These are instantaneous fields, but we may have some issues with sub-stepping (I cannot forget how we shut off all substeps).

For now if it is not too much effort it might be good to do a quick update of the emulator just to see if I can get the code to run with it before we finalize and check everything in.

djgagne commented 1 year ago

@andrewgettelman Are the input QC, QR, etc. fields in-cloud or for the whole grid cell? I can get the initial data processed through my scripts either today or Monday.

andrewgettelman commented 1 year ago

Thanks @djgagne - Everything (input, output and tendencies) are in-cloud now,

andrewgettelman commented 1 year ago

Here is a quick comparison of the new code climatology, compared to the 2021 paper version. Note the 'control' is very different now with PUMAS (control = KK2000). Thus this seems to be doing what we think it should be doing.

zm_tau_ported5comp

@djgagne if you have any questions on the data used for the emulator, let me know. If you get a new emulator file, I can test it (or I can point you to the code to integrate it. Thanks!

@cacraigucar, I'm not sure what we want to do with this now. I guess there is not a strong urge to move it into CAM until we get the emulator sorted out.

djgagne commented 1 year ago

@andrewgettelman I met with Wayne earlier this week to go over the emulator training code and will work on getting an updated baseline emulator trained. I have had a large number of meetings and haven't had time to do the processing yet but was at least able to take s quick look at and copy the CAM output files.

ESCOMP / PUMAS

Updating ML Warm Rain Emulators #50