E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
354 stars 368 forks source link

Some tests downloading significant input data #5349

Open ndkeen opened 1 year ago

ndkeen commented 1 year ago

Certainly any machine running climate simulations will need a fair amount of space for the input data as well as the output. For all machines, we have a specified location for inputdata where any data needed for a case is downloaded and then can be used by multiple users. The total amount of data collected into inputdata slowly grows over time -- we rarely delete data that is no longer used. At NERSC, the total space used in /global/cfs/cdirs/e3sm/inputdata is currently 77.8 TB (68 TB in atm). Sometimes there are situations where it would be beneficial to be more careful about what data is downloaded for a given case. One such situation are trying to maintain testing on machines where there is limited disk space for this data. For machines that are mostly (or only) used for testing, a minimum set of inputdata would be better (or required). And we are currently having an issue now using GCP (google cloud platform) cluster that has a 2TB disk (for all use, shared across all users) and disk space is rented at a premium.

For any case, we can change the location of inputdata with xmlchange DIN_LOC_ROOT=/newi where the new location would be populated with data required for each case (the data is downloaded via wget from blues server at ANL). For a few common test suites, I made this change on pm-cpu as a test to see how much data is actually required to run the tests (as currently defined).

e3sm_developer       782.1 GB /pscratch/sd/n/ndk/i-e3sm_developer
e3sm_integration     893.5 GB /pscratch/sd/n/ndk/i-e3sm_integration
e3sm_extra_coverage 1354.9 GB /pscratch/sd/n/ndk/i-e3sm_extra_cov

Now, of course, running e3sm_integration and e3sm_extra_coverage with same inputdata location would be a savings as there would be files needed in both that would only need to be downloaded once.

And there are a few stand-outs -- ie tests that download a significant portion of the data. They are all cases that want forcing data and while the test may only run a few days, multiple years of forcing data is downloaded.

In extra_coverage suite, the following directory is 1087GB which is 80% of the total needed.

ocn/jra55/v1.3_noleap

And for the other suites, these forcing data dirs:

atm/datm7/atm_forcing.datm7.GSWP3.0.5d.v1.c170516            396.2 GB
atm/datm7/atm_forcing.datm7.cruncep_qianFill.0.5d.V4.c130305 198.8 GB

If it's easy and practical, would be great if a case could be smarter about what data it will actually need. I think this is good in general, but could specifically help on machines with limited disk space. For now, I'm going to try freeing up space on GCP inputdata that is no longer being used by the current set of test cases, but wanted to document the issue.

I also have the required data download per test case. For example, the test ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_gnu.elm-erosion is downloading 445 GB alone. Note during this testing, I found a few cases that are unable to download all the data they need without help. Several cases need data here atm/cam/chem/trop_mozart_aero/emis/DECK_ne30 which is not automatically downloaded. The total space of this dir is only 33GB. Also, any test using MPAS that requires 2 cases (ie ERS, PET, etc) which has a case2run that needs a different PE layout will fail at runtime -- the test case is not realizing it may need different MPAS partition files for the case2run.

rljacob commented 1 year ago

https://github.com/E3SM-Project/E3SM/pull/5150 handled this for one case.

ndkeen commented 1 year ago

Specifically for the case of SMS_D_Ld1.TL319_EC30to60E2r2.DTESTM-JRA1p5, with Apr27th master, if I download the data needed to local dir, there are 178 files totaling 350.4 GB. /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/jjra/SMS_D_Ld1.TL319_EC30to60E2r2.DTESTM-JRA1p5.pm-cpu_gnu.20230427_093602_7hlylu/inputdata

If I use a proposed different test: SMS_D_Ld1.TL319_EC30to60E2r2.DTESTM-JRA1p5.pm-cpu_gnu.mpassi-jra_1958 and downloaded the data required from scratch, there are 119 files totaling 197.2 GB. /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/jjra/SMS_D_Ld1.TL319_EC30to60E2r2.DTESTM-JRA1p5.pm-cpu_gnu.mpassi-jra_1958.20230427_090012_is8zmt/inputdata

This is in reference to https://github.com/E3SM-Project/E3SM/pull/5639

rljacob commented 1 year ago

The fixes in #5639 still allow 63 JRA.v1.5.runoff* files to be donwloaded. Each file is 3GB for a total of 192GB just for the runoff files. @jonbob all the data models are basically the same so it should be possible to add changes to drof like you did for datm in #5639.

ndkeen commented 1 year ago

Noting that SMS_D_Ld1.TL319_EC30to60E2r2.DTESTM-JRA1p5.pm-cpu_gnu.mpassi-jra_1958 on current next still downloads large amount of data (same as above -- 119 files totaling 197.2 GB).

I'm also trying to download all of the data needed for SMS_D_Ln3.TL319_EC30to60E2r2_wQU225EC30to60E2r2.GMPAS-JRA1p5-WW3 which is taking a while and will also be a large amount -- perhaps some of the same data.

These will not be able to run on machines with limited space, such as GCP.

jonbob commented 1 year ago

@rljacob - yes, it should be possible. Do we want a generic data model setting for start_ and stop_year, or just add something like is in datm to drof?

ndkeen commented 1 year ago

PR https://github.com/E3SM-Project/E3SM/pull/5670 just made a change that might have reduced the number of required downloaded files (again) -- update, just realized my test has the nersc portal change only and does NOT contain the changes in the PR noted. So will retest -- ok now I only see 57 files downloaded and 22.4 GB.

ndkeen commented 1 year ago

After https://github.com/E3SM-Project/E3SM/pull/5670, there are still (at least) 2 tests that will download a lot of data.

One is SMS_D_Ln3.TL319_EC30to60E2r2_wQU225EC30to60E2r2.GMPAS-JRA1p5-WW3.ww3-jra_2004 which may just need same changes as the mentioned PR.

And the other is ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.gcp12_gnu.elm-erosion