ESCOMP / CTSM

Community Terrestrial Systems Model (includes the Community Land Model of CESM)
http://www.cesm.ucar.edu/models/cesm2.0/land/
Other
308 stars 312 forks source link

CTSM-FATES timing/cost #1076

Open dlawrenncar opened 4 years ago

dlawrenncar commented 4 years ago

@rosiealice @ckoven @wwieder @ekluzek Following up on conversations we had about CTSM-FATES costs at the LMWG meeting. The new CSL allocation request is going to need to be written this summer and this would be good opportunity to try to establish at least general FATES costs, relative to big-leaf CTSM, for the purpose of writing the proposal. I'm aware that the costs will be more variable through time, but even a ballpark estimate would be helpful for perhaps the SP version and the full competition BGC version. I'd be happy to explore myself, which would give me opportunity to run FATES for the first time. Perhaps @rosiealice could point me to two relevant cases (SP, BGC full competition) to start with.

Definition of done:

ckoven commented 4 years ago

@dlawrenncar if it helps, my historical transient runs took 20 minutes per decade at 4x5 resolution (so ~1000 gridcells, of which ~750 had living vegetation) on 288 cores on cheyenne.

output from that run is here: /glade/scratch/charlie/archive/fates_clm50_global_4x5_historicaltransient_2e3f469f_2905a9ba if you're interested.

ckoven commented 4 years ago

sorry, 20 minutes per year.

dlawrenncar commented 4 years ago

Thanks, I was going to try to establish costs for 1 and 2 deg simulations, presuming that 4x5 resolution is (a) cheap enough as to mainly be in the noise and (b) mainly for development/debugging purposes and not production.

On Tue, Jul 7, 2020 at 10:58 AM Charlie Koven notifications@github.com wrote:

sorry, 20 minutes per year.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/issues/1076#issuecomment-654995563, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFABYVAJ3FL7BJ6PDVO62VDR2NH4NANCNFSM4OTATHLA .

wwieder commented 4 years ago

here's a recent 4x5 no crop with timing files for comparison

/glade/scratch/wwieder/clm5_4x5_woodCN_cont_spin/run/timing

On Tue, Jul 7, 2020 at 11:53 AM David Lawrence notifications@github.com wrote:

Thanks, I was going to try to establish costs for 1 and 2 deg simulations, presuming that 4x5 resolution is (a) cheap enough as to mainly be in the noise and (b) mainly for development/debugging purposes and not production.

On Tue, Jul 7, 2020 at 10:58 AM Charlie Koven notifications@github.com wrote:

sorry, 20 minutes per year.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/issues/1076#issuecomment-654995563, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AFABYVAJ3FL7BJ6PDVO62VDR2NH4NANCNFSM4OTATHLA

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/issues/1076#issuecomment-655025347, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5IWJDBPIMWQ4PRKV2IN2DR2NOK5ANCNFSM4OTATHLA .

-- Will Wieder Project Scientist CGD, NCAR 303-497-1352

ekluzek commented 4 years ago

@dlawrenncar is going to work on this. Note, that you need to make sure that FATES is spunup well for your timing tests. FATES is more likely to have a change in timing cost as the model runs, unlike big-leaf CTSM.

rosiealice commented 4 years ago

Hi @dlawrenncar , I set these 4x5 simulations off before I saw your comment on the 1/2 deg simulations.

FATES Fixed biogeog with competition with 6 PFTs. 40y /glade/scratch/rfisher/archive/fates_timin_6pft_fbg_comp/lnd/hist

CLM5 BGC no crop. 40y. /glade/scratch/rfisher/archive/clm5_timing_bgc/lnd/hist

Broadly, the CLM simulation took 3 hours, and the FATES one took 18. It might be slightly longer given the spinup issues Eric mentioned. I can take a look at the latter years to check (lots to do today before they turn the computer off!!)

One interesting thing would be to look at how the FATES aggregation parameters affect the speed (because they should, especially for patch dynamics...

dlawrenncar commented 4 years ago

OK. Looking at the timing files, I see:

FATES Fixed biogeog with competition with 6 PFTs

/glade/work/rfisher/git/ctsmjuly20/cime/scripts/fates_timin_6pft_fbg_comp/timing 82 pe-hrs/yr (74 in first submission) 53 yrs/day

CLM BGC no crop

/glade/work/rfisher/git/ctsmjuly20/cime/scripts/clm5_timing_bgc/timing 14 pe-hrs/y 303 yrs/day

So, as @rosiealice showed, the cost is ~6x. Obviously, that's a lot. I took a quick look through the timing files to see if I could see the source of the cost increase and I guess it will not be a surprise that my quick look suggests that the cost increase is mostly embedded within canflux. I guess this is simply due to the larger number of calls to photosynthesis due to the much larger number of patches compared to pfts. So, probably this is going to be hard to reduce without reducing the cost of the canopy flux calculations.

At some stage, I will check again at 2deg resolution for the purpose of writing the next CSL allocation proposal. We should also have a conversation at some point soon about costs since this large increase in costs is going to significantly impact computational resource planning and will have implications for implementation in CESM3.

On Fri, Jul 10, 2020 at 2:31 AM Rosie Fisher notifications@github.com wrote:

Hi @dlawrenncar https://github.com/dlawrenncar , I set these 4x5 simulations off before I saw your comment on the 1/2 deg simulations.

FATES Fixed biogeog with competition with 6 PFTs. 40y /glade/scratch/rfisher/archive/fates_timin_6pft_fbg_comp/lnd/hist

CLM5 BGC no crop. 40y. /glade/scratch/rfisher/archive/clm5_timing_bgc/lnd/hist

Broadly, the CLM simulation took 3 hours, and the FATES one took 18. It might be slightly longer given the spinup issues Eric mentioned. I can take a look at the latter years to check (lots to do today before they turn the computer off!!)

One interesting thing would be to look at how the FATES aggregation parameters affect the speed (because they should, especially for patch dynamics...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/issues/1076#issuecomment-656556001, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFABYVF45RJZCSIWECSBKZLR23GVHANCNFSM4OTATHLA .

ckoven commented 4 years ago

I'm actually a bit surprised that the number of patches should be greater than the number of PFTs. FATES currently limits it to 10 primary patches per site, which seems like less than 6x the number of PFTs in big-leaf CLM?

rgknox commented 4 years ago

For any given column, we seem to be operating with our 1600 cohorts each requiring their own photosynthesis calculations, a scheme which is embedded in the canflux iterative solve. So only 6X, yay.

dlawrenncar commented 4 years ago

Sorry, yeah, it is the much larger number of cohorts compared to CTSM PFTs that is the likely source, not the potentially larger number of patches. So, with much larger number of cohorts compared to PFTs, only 6x, yay! ... but still 6x, boo!

On Fri, Jul 10, 2020 at 9:55 AM Ryan Knox notifications@github.com wrote:

For any given column, we seem to be operating with our 1600 cohorts each requiring their own photosynthesis calculations, a scheme which is embedded in the canflux iterative solve. So only 6X, yay.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/issues/1076#issuecomment-656749190, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFABYVAH7YES5S2Q5AXGYSDR242V3ANCNFSM4OTATHLA .

rgknox commented 4 years ago

For accuracy's sake, I just want to point out that we actually perform photosynthesis on each leaf-layer of each canopy-layer, and since we use PPA, each pft has its own leaf layer (sorry, I shouldn't had said cohort). So we end up doing photosynthesis on an array that is pft x leaf-layer x canopy layer. In order to make sure we hit all the possible layers, we DO loop by cohort, but we also do a lot of masking to avoid redundancy.

See here, we actually loop over leaf layers: https://github.com/NGEET/fates/blob/master/biogeophys/FatesPlantRespPhotosynthMod.F90#L375

That said, I think that reducing our number of leaf layers is something that could potentially reduce computation, and may be low hanging fruit.

rgknox commented 4 years ago

There is also this old issue here: https://github.com/NGEET/fates/issues/386

rosiealice commented 4 years ago

Also for accuracy's sake, while 10 is the maximum number of patches, if the fusion criteria were on the fussy end, and every site had the max number of patches, that would be inefficient. I meant to check the distribution of NPATCHES but all the computer resources are down today because of Wyoming-related things.

The number of all calculations (photosynthesis, cohorts etc.) scales with NPATCHES, but as @rgknox mentioned, the correspondance between the number of cohorts and the number of photosynthesis calculations is more complicated...

rosiealice commented 4 years ago

Anthony Walker (who isn't on this repo) also had mentioned in the past that he was working on an analytical method for speeding up the photosynthesis routines. If we could do that, it might be pretty useful. I'll ask him...

dlawrenncar commented 4 years ago

I ran 2deg and 4x5 FATES fixed biogeography and got timing costs for 1850 Control

2deg: ~600 pe-hrs/yr (compared to 75 pe-hrs/yr CLM5BGC) = 8x 4x5: ~110 pe-hrs/yr (compared to 14 pe-hrs/yr CLM5BGC) = 8x

In both cases, cost comes into 'equilibrium' after about 4-6 years of simulation.

Cases: /glade/work/dlawren/cases/timing_tests/ctsm50fates_2deg_timing_fixed_biogeog /glade/work/dlawren/cases/timing_tests/ctsm50fates_4x5_timing_fixedbiogeog

Would be good to discuss at forthcoming CTSM-FATES meeting.

ckoven commented 4 years ago

it seems like one place to start on this is to put timing calls around every instance where CTSM calls FATES code, so that we can better understand exactly where the costs in FATES are being incurred?

jkshuman commented 4 years ago

FWIW @dlawrenncar with my fire runs on a somewhat old tag (tag1331_api81) with fire disturbance for the tropics only I get 1 deg tropics fire: ~530 pe-hrs/yr

dlawrenncar commented 4 years ago

@jkshuman Interesting. Just to be clear, is this a Tropics only run? I guess it must be. What is the domain more precisely?

jkshuman commented 4 years ago

@dlawrenncar yes, tropics only offline land. here is a fig for reference and coords below. Do you want me to kick off another run with a more recent tag? long= 0 to 360 lat = -55 to 30 TLAI_per_yr13_tropicsYr70_Fire_Hybrid_1deg_MoiT_C3_lightning_1x1_a1a8efe5_d5289355

jkshuman commented 4 years ago

I should add that this is a 3PFT run with fire disturbance, so perhaps not directly comparable?

dlawrenncar commented 4 years ago

I don't think we need another run with a more modern tag. For CSL, I am using global numbers and I have what is needed for the proposal. Next steps on costs need to be as outlined above to get more accurate and full understanding of where the costs are coming from.

On Tue, Aug 18, 2020 at 11:20 AM jkshuman notifications@github.com wrote:

I should add that this is a 3PFT run with fire disturbance, so perhaps not directly comparable?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/issues/1076#issuecomment-675608555, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFABYVDBHRZ7U56E6VHGUV3SBKZ7NANCNFSM4OTATHLA .

billsacks commented 2 years ago

Cross-reference: see the more recent issue https://github.com/NGEET/fates/issues/859

ekluzek commented 2 months ago

Closing as it looks like the thing that was needed was figured out.

wwieder commented 2 months ago

Definition of done is when Fates timing files added to cesm3 website for SP and FixedBiogeoraphy simulations

ekluzek commented 2 months ago

Notes from Jim about how to store the timing results:

I forgot to mention in yesterday's meeting - You may get requests to update the cesm3 timing table at https://cseg.cgd.ucar.edu/timing/timings/ I will handle the fully active compsets and feel free to forward those requests to me. But I would appreciate it if you would do the component specific compsets yourself. To do so run a PFS test with the compset and resolution you want, then upload the resulting timing table to https://cseg.cgd.ucar.edu/timing/upload/ (you will need to login to do this).

samsrabin commented 1 month ago

Procedure needed for some compsets, including fixed-biogeo—see #2745.