NCAR / ccpp-scm

CCPP Single Column Model
Other
13 stars 50 forks source link

Running time of ~7 hours versus about 15 minutes #289

Closed gthompsnWRF closed 1 year ago

gthompsnWRF commented 2 years ago

Description

I don't even know how it would be possible, but when I run ccpp-scm with ARM-SGP case and GSD suite physics on Cheyenne under my /glade/work directory, it is taking over 7 hours to run. Yet when I run the same thing within my /glade/home area, it takes roughly 15 minutes. No idea how this could be possible.

Steps to Reproduce

I will now test this again to see if it is entirely reproducible but I will do a git clone on a fresh directory within Cheyenne's /glade/work and another in /glade/home and see if I can reliably reproduce it.

climbfuji commented 2 years ago

This seems like a big difference, but it is possible. /glade/home/ and /glade/work/ are separate filesystems optimized for different use. If you hammer the parallel file system /glade/work/ designed for large jobs, large volumes of data and parallel reads/writes with hundreds of tiny write requests, then you keep the metadata servers busy and impact the performance - possibly not only for yourself, but also for others. /glade/home/ probably has settings for what users have in their home directories usually, smaller files, but many of them. And probably much less load on average on the filesystem/metadata servers than work or scratch.

climbfuji commented 2 years ago

I would guess that if SCM buffered the data and wrote it out all at once at the end instead of every time step (I think that's what it does), you wouldn't see such a difference - but possibly an overall increase in performance everywhere. @grantfirl correct me if I am wrong.

gthompsnWRF commented 2 years ago

@climbfuji I can follow up with this. My testing has revealed that a brand new fresh clone of ccpp-scm and running 100% out-of-the-box GSD_v1 on /glade/home partition took about 7 minutes while doing exactly the same steps on /glade/work took 18 minutes.

Now I am running with reduced timestep from 600 seconds to 60 seconds (more realisitic!) and radiation timestep from 3600 seconds to 600 seconds and I am awaiting its results but it's already over 40 minutes of wall clock time on /glade/work and I just submitted the job on /glade/home. You can see my next issue/enhancement request since it took me about 30 minutes to scrub my /glade/home area clean enough to permit my next run. While Cheyenne may be a bit unique, a doubling of runtime (or more) is really a bit crazy. And 7 hours compared to 10-15 mins is just ludicrous. There must be a better way.

gthompsnWRF commented 2 years ago

And another follow-up since it is very clearly consistent now.

The entire ARM-SGP case took 19 minutes to run with a reduced set of timesteps to 60s (dynamics and microphys) and 600s (radiation) and finished nearly 27 days simulation in about 19 minutes when outputting to the /glade/home partition. But even after 1 hour and 10 minutes on /glade/work partition, the simulation had reached only 390K/2.5M seconds (15% complete) by an hour and 14 minutes on /glade/work. So I will QUIT working on /glade/work and use /glade/home and copy results to the work directory to avoid filling my quota.

Hopefully this helps with a redesign of output strategy.

climbfuji commented 2 years ago

@gthompsnWRF I transferred this issue from ccpp-physics to ccpp-scm.

dustinswales commented 1 year ago

Seems the problem has been identified. Moving from Issue to Discussions for posterity.