NOAA-CEFI-Regional-Ocean-Modeling / ocean_BGC

3 stars 8 forks source link

wall time on gaea #81

Closed eric-mortenson closed 2 months ago

eric-mortenson commented 3 months ago

I tried running the default NWA12 configuration xml: CEFI_NWA12_cobalt.xml

but the simulation timed out. I was able to fix the issue by changing the Wallclock and segRuntime from 13 to 16 hours:

yichengt900 commented 3 months ago

@eric-mortenson, thanks for reporting the timing issue. We are also currently investigating the runtime for our NWA and NEP domains on Gaea. We will keep you updated.

By the way, since the NWA XML and runtime issues are more related to the CEFI-regional-MOM6 repo and the ocean_BGC is more related to the ocean biogeochemistry codes themselves, you may consider opening a GitHub issue there in the future.

eric-mortenson commented 3 months ago

Hi Yi-Cheng,

I am getting confused now. I tried running the NWA12 experiment using 1 model year and 16 hours wall time and it ran successfully, as I mentioned. I also previously ran it for 5 model years and 16 hours wall time which was successful. But then I ran it for 27 model years, like the default case, with 16 hours wall time and it failed due to timing out in the first year. I don't think I did anything wrong, I've rechecked the frerun command I used, and it looks correct. The only difference is that I originally used the overwrite option (-o) and in this case I used -e, but I don't think that would cause the wall time issue (lines pasted below)

frerun -e -x CEFI_NWA12_cobalt_v1.xml -p ncrc5.intel22 -t prod CEFI_NWA12_COBALT_V1

sbatch /gpfs/f5/cefi/scratch/Eric.Mortenson/fre/cefi/NWA/2024_06/CEFI_NWA12_COBALT_V1/ncrc5.intel22-prod/scripts/run/CEFI_NWA12_COBALT_V1

For the second point, sorry, I thought an issue from anywhere within the wiki/github went to the same place. Next time I will move to the appropriate directory before raising an issue so it's more clear.

Cheers, Eric

On Wed, Jul 17, 2024 at 11:12 AM Yi-Cheng Teng - NOAA GFDL < @.***> wrote:

@eric-mortenson https://github.com/eric-mortenson, thanks for reporting the timing issue. We are also currently investigating the runtime for our NWA and NEP domains on Gaea. We will keep you updated.

By the way, since the NWA XML and runtime issues are more related to the CEFI-regional-MOM6 https://github.com/NOAA-GFDL/CEFI-regional-MOM6 repo and the ocean_BGC is more related to the ocean biogeochemistry codes themselves, you may consider opening a GitHub issue there in the future.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-CEFI-Regional-Ocean-Modeling/ocean_BGC/issues/81#issuecomment-2233564443, or unsubscribe https://github.com/notifications/unsubscribe-auth/BI7AYVBZCPVC6GKEKPF4CODZM2CXBAVCNFSM6AAAAABLAZFFSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZTGU3DINBUGM . You are receiving this because you were mentioned.Message ID: @.*** com>

yichengt900 commented 3 months ago

Hi @eric-mortenson,

No worries. Your run script and the frerun command you used look good to me. My first guess is that the Gaea F5 filesystem may have some I/O issues, significantly slowing down your model run. You can always resubmit the run if it crashes due to a timeout. Unfortunately, we are using a 1-year segment interval, so it will always restart from day 1 of the crashing year.

By the way, I have a PR that can significantly reduce our NWA runtime (7.5 hours for a 1-year simulation) while maintaining good model results. I recommend trying that configuration.