E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
334 stars 334 forks source link

Implement topounits and downscaling in E3SM #4273

Closed tktesfa2014 closed 2 years ago

tktesfa2014 commented 3 years ago

Implementation of topography-based subgrid structure (topounits) and downscaling methods of atmospheric forcing from atmosphere grid to land topounits. It allows variable number of topounits per grid improving computational demand as compared to having the same number of topounits per grid for all grids. Downscaling methods include two approaches of precipitation downscaling developed at PNNL. Also, this branch includes downscaling methods for other atmospheric forcing developed at the University of Arizona. In addition, topounits have been implemented into the dynamic subgrid code and data structures of recent E3SM features such as soil erosion, irrigation and plant hydraulics.

[BFB]

rljacob commented 3 years ago

Does this replace #3864 ?

rljacob commented 3 years ago

Is this BFB with current watercycle cases?

tktesfa2014 commented 3 years ago

@rljacob : yes this replaces #3864.

tktesfa2014 commented 3 years ago

@rljacob: If the downscaling is turned on, it is not BFB.

rljacob commented 3 years ago

Please add a test that will turn on the downscaling.

tktesfa2014 commented 3 years ago

Thanks @rljacob . Okay I am working on it.

singhbalwinder commented 2 years ago

@tktesfa2014 : I have added a new test for topounit in the land integration suite. The surface data file to go with the test has been uploaded to the input data server.

tktesfa2014 commented 2 years ago

Thanks @singhbalwinder ! @rljacob a test has been added.

rljacob commented 2 years ago

I don't understand these messages from github. 457d797c - Merge branch 'tktesfa2014/lnd/archv2_disag_grd_to_topo_rebase2' of github.com:E3SM-Project/E3SM into tktesfa2014/lnd/archv2_disag_grd_to_topo_rebase2 what are you trying to do?

bishtgautam commented 2 years ago

Adding @singhbalwinder to review the PR as it modifies EAM.

tktesfa2014 commented 2 years ago

@rljacob I had to do git pull after Balwinder added an SMS test; and then pushed my commits to Github. That is all I did.

tktesfa2014 commented 2 years ago

@bishtgautam the changes to EAM have already been reviewed by @singhbalwinder in the earlier PR and I have not made any changes to the EAM after that.

rljacob commented 2 years ago

@tktesfa2014 ok. When you do "git pull" add "--ff-only". Its doing a merge commit to update your local copy of the branch but we want those to be fast-forward merges.

tktesfa2014 commented 2 years ago

Thank you @rljacob ! Okay.

tktesfa2014 commented 2 years ago

@bishtgautam I have addressed all your comments. Thanks!

rljacob commented 2 years ago

Status: will be rebased now that OpenACC cleanup is in master.

bishtgautam commented 2 years ago

@peterdschwartz Do you have any comments for this PR? If not, I plan to merge this to next today (provided next is open).

peterdschwartz commented 2 years ago

@bishtgautam @tktesfa2014 Encountered a bug in a case not covered by the e3sm_land_developer test suites. I had issues with my Compy account the past couple of weeks, but I should be able to look into it and provide a fix this week.

rljacob commented 2 years ago

note: @tktesfa2014 is doing more tests with the bug fix.

bishtgautam commented 2 years ago

@peterdschwartz With your latest commit, is this PR ready to be merged?

peterdschwartz commented 2 years ago

@tktesfa2014 My commit added veg_pp@wttopounit to subgridRestMod as discussed. Additionallly, when testing on chrysalis I noticed that the topounit test was failing due to threading in initGridcellsMod. I fixed that by putting the new topounit parts into a subroutine called set_topounit rather than making the new variables threadprivate in the directive.

I was able to get all e3sm_land_developer tests to pass or diff as expected with the latest baselines so it should be ready to go.

peterdschwartz commented 2 years ago

@bishtgautam The PR should be ready to be merged -- just let me know if there are any questions or issues with the perennial crop additions.

bishtgautam commented 2 years ago

The PR has been merged to next (ad88227). I made an error in the commit message (PR-4273 instead of PR #4273) because of which the merge info regarding this PR to next does not show up here.

rljacob commented 2 years ago

Overnight testing is showing diffs in all cases with EAM and ELM. Was this PR tested for BFB for F-cases?

bishtgautam commented 2 years ago

@tktesfa2014 @peterdschwartz Can you answer @rljacob's question?

bishtgautam commented 2 years ago

@rljacob The ELM tests are passing (e.g. https://my.cdash.org/viewTest.php?onlypassed&buildid=2057098)

rljacob commented 2 years ago

Yes but the mess in anything with EAM and ELM makes me think no one looked at those cases with this PR.

thorntonpe commented 2 years ago

Checking with Teklu on this...

tktesfa2014 commented 2 years ago

Yes, I did test runs using FC5AV1C-L --res ne30_g16 to test the implementation of new subgrid structure. But, with downscaling turned on I do not expect it to be BFB.

tktesfa2014 commented 2 years ago

@rljacob Yes, I did test runs using FC5AV1C-L --res ne30_g16 to test the implementation of new subgrid structure. But, with downscaling turned on I do not expect it to be BFB.

rljacob commented 2 years ago

After discussion on Slack, we determined the "uovern" field has different values even with downscaling off. So this PR has been reverted from next because it causes diffs in every coupler history file. Those diffs are harmless (since that field isn't used) but its a lot for so close to the v2 tag.

I suggest you add logic so that the field is not changed at all unless downscaling is on.

tktesfa2014 commented 2 years ago

Thanks @rljacob! I see your point. If the downscaling is not on, we do not need to calculate uovern; so I think introducing a downscaling flag to eam will solve the problem. What do you think? Thanks!

rljacob commented 2 years ago

You already have use_atm_downscaling_to_topunit so just use that. I think the problem is that it wasn't set to any value before. So even setting it to zero will cause diffs. You could just add an if statement around the one line you added to atm_import_export. You could also put a check around the call to calc_uovern

tktesfa2014 commented 2 years ago

@rljacob okay, thanks!

tktesfa2014 commented 2 years ago

@rljacob I introduced the changes as suggested but eam is not able to see the flag that is defined on elm side in elm_varctl.F90 . I am getting the following error message: Check INCLUDE paths. [ELM_VARCTL] use elm_varctl , only: use_atm_downscaling_to_topunit

Which one of the following is best?

  1. adding elm_varctl.F90 to include path
  2. defining use_atm_downscaling_to_topunit on eam side as well. if option 2, then could you direct me to the eam file equivalent to elm_varctl.F90? Or, if option 1, where should I add the path? Or, any tips? Thanks!
rljacob commented 2 years ago

Sorry I thought use_atm_downscaling_to_topunit was in the eam namelist. I see now its in elm.

Definitely not option 1. Are you familiar with the infobuffer? Its a way for the components to share simple flags and constants.

tktesfa2014 commented 2 years ago

@rljacob No I am not familiar with the infobuffer.

tktesfa2014 commented 2 years ago

@rljacob could you please point me to a file that has simple flags shared by components? I think if there is a way to share flags between components that seems to be the best option. Another option is to define a separate flag on the EAM side. Please let me know your thoughts. Thanks!

rljacob commented 2 years ago

In elm, look for calls to seq_infodata_PutData in cpl/lnd_comp_mct.F90 (it was infodata, not infobuffer). You'll need to add a new logical to seq_infodata_type in driver-mct/shr/seq_infodata_mod.F90 which will be used to hold the value of use_atm_downscaling_to_topunit to you set by adding a PutData call. You can then add a similar logical to EAM (not sure where) and get its value from the infodata with a call to seq_infodata_GetData in eam/src/cpl/atm_comp_mct.F90

tktesfa2014 commented 2 years ago

Thanks @rljacob!

tktesfa2014 commented 2 years ago

@rljacob and @bishtgautam with the help of @singhbalwinder we have now the downscaling flag implemented through the infodata. You can proceed with the merging. Thank you all!

rljacob commented 2 years ago

There are unexpected run-time fails in single-column tests: SMS_Ln5.ne4_ne4.F-MMFXX-SCM-ARM97.chrysalis_intel SMS_R_Ld5.ne4_ne4.FSCM5A97.chrysalis_intel.eam-scm

I did notice some mods around single-column related code.

The error is from an av copy in driver-mct/main/prep_lnd_mod.F90:358 call mct_aVect_copy(aVin=a2x_l, aVout=x2l_l, vector=mct_usevector, sharedIndices=a2x_SharedIndices)

I think it means there's an attribute missing in the aVin.

tktesfa2014 commented 2 years ago

@rljacob and @bishtgautam I will check and fix this. Thanks!

rljacob commented 2 years ago

I tried reproducing the fail on the branch and couldn't. You might need to do a test merge to next to see it.

tktesfa2014 commented 2 years ago

@rljacob thank you for the information.

peterdschwartz commented 2 years ago

@tktesfa2014 @rljacob I'm not sure if a fix has been found yet, but from doing a few runs on different machines with SMS_R_Ld5.ne4_ne4.FSCM5A97.chrysalis_intel.eam-scm, the memory corruption that causes the error in prep_lnd_mod appears to be due to the changes in SatellitePhenologyMod.F90. Specifically, with the reading of monthly vegetation data routine. (replacing this module with the one from the master branch allows this test to pass) I'm not really familiar with SCM runs so I'm not sure how that's related, but maybe the extra topounit dimension is causing problems when reading from fsurdat.

rljacob commented 2 years ago

@bogensch might know.

rljacob commented 2 years ago

memory corruption? I thought, based on the MCT error message, that there was some logical problem that probably involved how attributes are added for topo-on vs. topo-off.

peterdschwartz commented 2 years ago

@rljacob I printed out the shared indices between the atm and lnd component in prep_lnd_mod. They are contained in a local, save variable that is only written to on the first time step. The indices look valid for the first and second timestep, but on the 3rd timestep it is filled with junk that causes the segmentation fault. Since no routine writes to that variable after the first timestep, I am interpreting that as something writing to memory it shouldn't be somewhere else.

Running in debug on compy with valgrind actually gave a clearer backtrace that led me to SatellitePhenologyMod.

singhbalwinder commented 2 years ago

I am also looking into this issue. SCM test passes with master and not with this branch (the test crashes while reading MONTHLY_HEIGHT_BOT variable from the file.), which suggests that the additional max_topounit dimension is causing this issue. Which is indeed the case as removing it, the code gets across the failing point. I also noticed that (like @peterdschwartz mentioned), the values read from the input file by this test are different from the values this test reads, when run with master branch. Which clearly suggests that the test is reading invalid values when run using this branch.

I also talked with @jayeshkrishna yesterday and he suggested to implement a code block to get start and count for lsmpft dimension in ncdio_pio.F90. I implemented that but got into some other runtime errors. I am still looking into it.

Another confusing part is that SMS test runs fine with this branch which uses the same code (with max_topounits) and reads the same input file. There may be other difference between SMS and SCM tests but one difference is the number of columns. SCM uses just one column while SMS columns are based on the model resolution.

I think one simple fix would be to allocate arrays differently when max_topounits variable is 1 vs. greater than 1. This solution would involve some code changes due to the special treatment for cases where max_topounits is 1.