E3SM-Project / ACME-ECP

E3SM MMF for DoE ECP project
Other
9 stars 1 forks source link

Fail to run ne30pg2_ne30pg2 with ntasks = 13600 #104

Open guangxinglin opened 5 years ago

guangxinglin commented 5 years ago

With the new master FV physics for CRM, I created a case with a res of ne30pg2_ne30pg2 and ntasks = 13600. Then I submitted a job on Cori, but the job failed with the error msg as below.

decompInit_lnd(): Number of processes exceeds number of land grid cells 13600 7425 ENDRUN: ERROR in decompInitMod.F90 at line 175

This error did not show up for my previous runs with a res of ne30_ne30 using the old maser without FV physics.

Is this error expected for the FV physics? To fix it, do I have to reduce the ntasks below 7425?

worleyph commented 5 years ago

You might try reducing just the number of ntasks for LND (and leaving number of ATM ntasks as is). Someone else will to need to respond as to whether this is expected or not.

lee1046 commented 5 years ago

Gaunxin, did you make the new initial conditions for land for pg2 grid ?

whannah1 commented 5 years ago

Guangxing, I think there's 2 potential problems here. One is that you shouldn't alter the task count for land on Cori, unless this is for MAML.

The other problems is that if you're using more tasks than the number of elements on the dynamics grid you'll need to set the namelist variable "dyn_npes" to the number of elements. In the case of ne30 or ne30pg2 this is 5400. For pg2 grids the number of physics cells is equal to (# elements x 4), so you also shouldn't exceed that, which obviously isnt the problem here.

But this has me wondering why you are using such a high task count? Is this for MAML? I don't think you want to be running the land at 1 point per task, that's too much and it's not as efficient. I normally use 5400 for MMF runs on CPU machines.

AaronDonahue commented 5 years ago

Just a quick response to what @whannah1 wrote above. Theoretically if you run the model with more cores than dynamics element (i.e. >5400 on ne30) then the model should automatically set dyn_npes to 5400 so your run can go on. If you find that it doesn't do this then it could mean this check is broken and we should probably fix it. Alternatively I've found that if you manually set dyn_npes to an unreasonable number via namelist the check is skipped and the run can break, so in that case you would want to remove any dyn_npes values in the namelist or set it to 5400 (or equivalent).

guangxinglin commented 5 years ago

Jungmin, I haven't yet. I just try to make the model run and test the MAML for pg2 grid at this stage. Do you have the new land initial conditions for pg2 grid? if so, can I use them? thanks.

On Wed, Aug 21, 2019 at 11:21 AM Jungmin Lee notifications@github.com wrote:

Gaunxin, did you make the new initial conditions for land for pg2 grid ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/E3SM-Project/ACME-ECP/issues/104?email_source=notifications&email_token=AGE4GNHFYEFG5ICGBMUSLQDQFWBSDA5CNFSM4IOLT3QKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD42VLZI#issuecomment-523589093, or mute the thread https://github.com/notifications/unsubscribe-auth/AGE4GNFVNUGNZ6NZGEX6RUDQFWBSDANCNFSM4IOLT3QA .

whannah1 commented 5 years ago

We don't have the land initial condition for pg2 yet, but you should be able to run without it.

guangxinglin commented 5 years ago

Walter, yes. I am testing it for MAML, which needs more task than the standard MMF. Because you have to divide the ntasks by the number of land model you are using. I am still not clear why this is not a problem for ne30_ne30 but is an issue for ne30pg2? thanks.

On Wed, Aug 21, 2019 at 11:21 AM Walter Hannah notifications@github.com wrote:

Guangxing, I think there's 2 potential problems here. One is that you shouldn't alter the task count for land on Cori, unless this is for MAML.

The other problems is that if you're using more tasks than the number of elements on the dynamics grid you'll need to set the namelist variable "dyn_npes" to the number of elements. In the case of ne30 or ne30pg2 this is 5400. For pg2 grids the number of physics cells is equal to (# elements x 4), so you also shouldn't exceed that, which obviously isnt the problem here.

But this has me wondering why you are using such a high task count? Is this for MAML? I don't think you want to be running the land at 1 point per task, that's too much and it's not as efficient. I normally use 5400 for MMF runs on CPU machines.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/E3SM-Project/ACME-ECP/issues/104?email_source=notifications&email_token=AGE4GNDSQGIPEX7TCGPUWRTQFWBTRA5CNFSM4IOLT3QKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD42VM5I#issuecomment-523589237, or mute the thread https://github.com/notifications/unsubscribe-auth/AGE4GNGCB2TNTSIII2UFRGTQFWBTRANCNFSM4IOLT3QA .

guangxinglin commented 5 years ago

yse. I will reduce the number of ntasks for LND and try it. thanks.

On Wed, Aug 21, 2019 at 11:16 AM worleyph notifications@github.com wrote:

You might try reducing just the number of ntasks for LND (and leaving number of ATM ntasks as is). Someone else will to need to respond as to whether this is expected or not.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/E3SM-Project/ACME-ECP/issues/104?email_source=notifications&email_token=AGE4GNH7L3HOF2M6U7T7SDTQFWA6HA5CNFSM4IOLT3QKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD42U47A#issuecomment-523587196, or mute the thread https://github.com/notifications/unsubscribe-auth/AGE4GNDJ3IKDT6WVCLUU2G3QFWA6HANCNFSM4IOLT3QA .

guangxinglin commented 5 years ago

Yes, I think you are right. I just checked the dyn_npes in the runs of ne30_ne30 and ne30pg2_ne30pg2. I do find that dyn_npes is automatically set to 5400 in ne30_ne30, but not the case for ne30pg2_ne30pg2. I think that is the reason for the error I got. Thanks.

On Wed, Aug 21, 2019 at 11:26 AM AaronDonahue notifications@github.com wrote:

Just a quick response to what @whannah1 https://github.com/whannah1 wrote above. Theoretically if you run the model with more cores than dynamics element (i.e. >5400 on ne30) then the model should automatically set dyn_npes to 5400 so your run can go on. If you find that it doesn't do this then it could mean this check is broken and we should probably fix it. Alternatively I've found that if you manually set dyn_npes to an unreasonable number via namelist the check is skipped and the run can break, so in that case you would want to remove any dyn_npes values in the namelist or set it to 5400 (or equivalent).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/E3SM-Project/ACME-ECP/issues/104?email_source=notifications&email_token=AGE4GNBGLO3C3JX765TAOC3QFWCEXA5CNFSM4IOLT3QKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD42V2GI#issuecomment-523590937, or mute the thread https://github.com/notifications/unsubscribe-auth/AGE4GNA25I5EEHYNVJTNZUDQFWCEXANCNFSM4IOLT3QA .

whannah1 commented 5 years ago

The part of the code that catches this must not know about the ensemble mode, so you're stuck just using the number of land points. BTW, we can use more land points if we want with the "tri-grid" approach, but personally I think it makes sense to keep atmos and land components on the same grid even if it makes it harder to test new grids.

AaronDonahue commented 5 years ago

Is this something we can fix for ensemble mode? I recall that when I reviewed the FV PR I tested the dyn_npes automatic fix and it worked.

whannah1 commented 5 years ago

It looks like this check is coming from the land component, ./components/clm/src/main/decompInitMod.F90 so we'd need to change the check there. I'm not sure how to make it aware of the ensemble setup, but I don't imagine that it would be that hard to query at run time.

ndkeen commented 5 years ago

Kinda have to laugh a little bit here. This is something we have hit ~2 years ago and more than once. I got the impression that LND team feels like the solution is "just don't use too many processes for LND", but I think the code should simply just use the max. So what I do is always limit the number of LND processes manually in my larger runs -- it's not just the number of MPI's, it's the number of MPI's x threads -- so just setting LND threads to 1 can sometimes help immediately. Here is an old issue: https://github.com/E3SM-Project/E3SM/issues/1952

whannah1 commented 5 years ago

@ndkeen, in the case of MAML we'll be using somewhere between 16-64 instances of the land model, so being to use 4-16x the max number of processors for the land might be a big help. Although compared to the cost of the MMF, maybe it doesnt matter anyway.