Closed AndyHoggANU closed 1 year ago
Did this ever get fixed? Note that @micaeljtoliveira's scaling indicates that perhaps we should increase the panan-01 to ~2000 cores also.
A layout of 72x18 uses 1008 cores, which is divisible by 48. A quick test shows that the time-stepping is roughly 6% faster than with the original layout, for exactly the same SU cost. If a faster time-to-solution is needed, then using 1942 cores would be a good alternative.
@adele157 Shall we update the configuration to use 1008 cores?
Yes, 6% faster sounds good!
On Tue, Mar 7, 2023 at 3:30 PM, Micael Oliveira @.***> wrote:
A layout of 72x18 uses 1008 cores, which is divisible by 48. A quick test shows that the time-stepping is roughly 6% faster than with the original layout, for exactly the same SU cost. If a faster time-to-solution is needed, then using 1942 cores would be a good alternative.
@adele157 https://github.com/adele157 Shall we update the configuration to use 1008 cores?
— Reply to this email directly, view it on GitHub https://github.com/COSIMA/mom6-panan/issues/10#issuecomment-1457511620, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA44U4LOLXUBSFUIXR4H6TW222XJANCNFSM5EQ3324Q . You are receiving this because you were mentioned.Message ID: @.***>
While we’re at it, how about changing to 1821 cores as you suggest in #20 ?
Also, while you're at it ... do you think you could recommend an optimal core count/layout for the global-01 config? We are currently on 80*75=6000 cores with 1585 masked - so 4415 cores in total ... but your scaling tests imply we could probably go higher?
While we’re at it, how about changing to 1821 cores as you suggest in #20 ?
@adele157 Sure! Where should these changes go? Shall I just commit them to both the panan-01-zstar-run and panan-01-hycom1-run branches? I would also need to put the mask somewhere sensible, ideally with the other ones (/g/data/x77/ahg157/inputs/mom6/panan-01/masks/
), but I don't have write permissions to that directory.
Also, while you're at it ... do you think you could recommend an optimal core count/layout for the global-01 config? We are currently on 80*75=6000 cores with 1585 masked - so 4415 cores in total ... but your scaling tests imply we could probably go higher?
@AndyHoggANU I'm now doing some tests for this configuration, but when I try to increase the number of cores to 6571 I get the following error at initialization:
FATAL from PE 3519: MPP_DEFINE_DOMAINS(mpp_compute_extent): domain extents must be positive definite.
I need to understand what is going on.
Yes, committing to both those 01 branches seems like the way to go to me. Any objections @angus-g?
With all these changes you're making to the panan setups (and probably OM3 soon), maybe it's easiest for you to apply for access to project ik11_w. Is that ok @aekiss?
I get the following error at initialization:
FATAL from PE 3519: MPP_DEFINE_DOMAINS(mpp_compute_extent): domain extents must be positive definite.
I need to understand what is going on.
I feel like I've seen this depending on the "halo" argument that you give to check_mask
, but I haven't looked any further into where it really comes from.
Yes, committing to both those 01 branches seems like the way to go to me. Any objections @angus-g?
All good from here!
@micaeljtoliveira - sorry, I haven't seen that MPP_DEFINE_DOMAINS
error in any of my runs, so not sure where it comes from. @angus-g, are you able to look into this a little more?
@angus-g Thanks for the tip! That indeed solves the problem. Does this mean that MOM (or SIS) need more than one halo point in each direction? If that's the case, it's a bit puzzling that this problem does not always show up.
With all these changes you're making to the panan setups (and probably OM3 soon), maybe it's easiest for you to apply for access to project ik11_w. Is that ok @aekiss?
I do have access to ik11_w. The issue here is that the directory in question is not in /g/data/ik11 and it has no write permissions for the group. @angus-g Would you mind fixing this?
Does this mean that MOM (or SIS) need more than one halo point in each direction? If that's the case, it's a bit puzzling that this problem does not always show up.
MOM6 generally needs a halo of 4 for its actual computation, but I think the domain extent error is at a lower level than this. I've come across that error when generating the masks with a few different halo sizes! Seems to be some interaction with the tile size and the masking it results in.
The issue here is that the directory in question is not in /g/data/ik11 and it has no write permissions for the group. @angus-g Would you mind fixing this?
I think maybe you should create an input directory in ik11 with just what's needed, and then maybe we can point to that as the canonical source, rather than my one?
I think maybe you should create an input directory in ik11 with just what's needed, and then maybe we can point to that as the canonical source, rather than my one?
Sounds like a good idea. I'll do that.
Nice - glad you solved it!
@angus-g There's already a directory in ik11 (/g/data/ik11/inputs/mom6/panan-01). This is only used in the master branch, which if I understand correctly, is not actually being used. Shall we update the contents of the directory or do I create a new one? We already have these input files spread over many directories, so it might be better not to increase the entropy any further.
@AndyHoggANU Here is a plot of the parallel efficiency for the global -01 configuration
It looks to me like the parallel efficient of current core count (4415) is already not that great wrt using ~1000 cores (~85% for the main loop), so I wouldn't use a higher core count unless you are in a hurry to get the runs done or if you have lots of SU's to spare.
I've now pushed the changes to panan-01-zstar-run and panan-01-hycom1-run. Let me know if there's any issue.
Closing this issue, as this has been solved in ccf54a2 and 079e632.
EDIT: It was a mistake I made with NIGLOBAL/NJGLOBAL in MOM_input - Works now!
@angus-g Thanks for the tip! That indeed solves the problem. Does this mean that MOM (or SIS) need more than one halo point in each direction? If that's the case, it's a bit puzzling that this problem does not always show up.
Hi @micaeljtoliveira, I've received this same MPP_DEFINE_DOMAINS
positive definite error when trying to increase the processors > 5000 on Setonix. I've tried the --halo 4
flag as well, however I'm still getting the same error. Do you remember what solved the problem here for you?
I wish I knew more about choosing the right mask. I usually just generate a bunch within a range i.e., --min_pe 5000 --max_pe 5050
and choose one that looks relatively squareish and a multiple of 8 but there's not much thought put into this...
The core layout for panan-01 has 962 cores ... note that 960 is divisible by 48, so our 21st node is almost unused. We are losing 5% off the bat here. Any chance we can figure out a config that is closer to full node use?