NCAR / ccpp-framework

Common Community Physics Package (CCPP)
http://www.dtcenter.org/community-code/common-community-physics-package-ccpp/
Other
26 stars 63 forks source link

Support multiple concurrent instances per MPI task #463

Closed michalakes closed 1 year ago

michalakes commented 1 year ago

Short description

Support multiple concurrent instances per MPI task closes #469

Changes to be committed:

modified:   scripts/mkstatic.py
modified:   src/ccpp_types.F90
modified:   src/ccpp_types.meta

Description

In order to have an ensemble of model instances in a JEDI executable, we need for packages the model relies on to be “stateless” – meaning that the data for an instances are not shared between instances. A more comprehensive solution would be to implement physics so that all such data is either passed in via arguments to the package, or allocated as an instance of a class defining the physics.

Deferring the larger issue of revisiting CCPP requirements to cover support for concurrent instances, this PR takes a more modest approach. It adds a new "instance" dimension to the types already defined for GFS physics (GFS_control, GFS_data, GFS_interstitial). And it includes some changes that were needed to address first-time initialization latches inside certain packages.

One of these, in Thompson MP, declears the heap allocated fields by ccpp instance. The issue here was that only the first instance of Thompson MP was initializing itself. The other guards against trying to allocate data that has already been allocated (e.g. h2ointerp and ozinterp).

The extent of packages touched by the PR is only one NRL-used suite that is a subset of all the packages in CCPP.

User interface changes?: No

Fixes:

Allow multiple instances of CCPP physics in single executable for ensemble DA #999

https://github.com/NCAR/ccpp-physics/issues/999

Testing:

Has been successfully tested in NRL NEPTUNE code, which includes a twoinsts-physics in its CI test harness.

test removed: unit tests: system tests: manual testing:

gold2718 commented 1 year ago

I have been thinking about this PR and have a few concerns (so I'm withdrawing my approval for now). First, it would really help to have a completed PR desription (the template text is still there at the moment). Second, please reference the appropriate issue number in the PR description (i.e., why are you making this change). In this case, there is at least one issue (NCAR/ccpp-physics#999), are there others?

It seems that the real use case is that you want to run multiple instances of a model in the same set of tasks (multiple overlapping instances), is that correct?

Can someone point me to the section of the current CCPP documentation (https://ccpp-techdoc.readthedocs.io/en/v6.0.0/) that covers rules for the CCPP framework? I do not see anything and in particular am having trouble finding anything similar to the "driver" rules section of the historical rules document (D1 -- D35 in https://dtcenter.org/sites/default/files/community-code/ccpp-requirements-historical.pdf). Still, looking at both documents, I do not see any requirement that states or implies that multiple overlapping instances of a model needs to be supported by the CCPP.

I think this is important because supporting multiple overlapping instances puts requirements on the CCPP Framework as well as on CCPP Physics Parameterizations and a clear requirement will guide both framework and parameterization development. As pointed out in the issue, there are existing physics parameterizations that contain module variables (e.g., h2ointerp) that might contain model state. Please correct me if I am wrong but if a parameterization holds model state (e.g., an evolving cloud PDF) as a module variable, then multiple instances would all be trying to use the same data which seems bound to fail (even without threading). @michalakes sort of pointed this out back in March when he said:

It's a problem if instances need different ozone or h2o data

Can the CCPP governance committee please issue a statement on this requirement and get it into documentation somewhere? If it is decided that supporting multiple overlapping instances of a physics suite is a CCPP requirement, can someone please add a "capgen unification" issue for it?

climbfuji commented 1 year ago

@michalakes I remember having a second new member of the ccpp_types data structure to capture the maximum number of instances (default one). This could be set during the ccpp (framework) init phase and then used to tell the physics to allocate the heap data to the correct size?

michalakes commented 1 year ago

I have been thinking about this PR and have a few concerns (so I'm withdrawing my approval for now). First, it would really help to have a completed PR description (the template text is still there at the moment). Second, please reference the appropriate issue number in the PR description (i.e., why are you making this change). In this case, there is at least one issue (NCAR/ccpp-physics#999), are there others?

Done

It seems that the real use case is that you want to run multiple instances of a model in the same set of tasks (multiple overlapping instances), is that correct?

Yes, we want multiple concurrent instances of a model, and thus multiple concurrent instances of CCPP and other libraries it uses, to work correctly when running on the same process, that is, MPI task.

gold2718 commented 1 year ago

@michalakes,

I respect your process

and

Thanks for being open to considering this further.

The reason the first step in our process is Open an Issue is so that discussions like this can happen well before the PR. I'm entering a discussion below since there is still no issue for this feature.

As for capgen, I would treat this by adding a new dimension to module-level variables. I could allocate enough space to hold the data for each instance and select the current instance by using the ccpp_instance variable (which would probably be passed in from the host like the thread number but that is flexible in capgen). However, to do this, I would need a new CCPP variable, call it ccpp_number_of_instances (similar to number_of_openmp_threads) which would be available at initialization time to be able to allocate the correct number of instance slots. In this case, we would just add a switch to capgen (e.g., --multi-instance) that would trigger this behavior. At that point, the number of instances is set by the host model at run time.

Thoughts on this approach? Is this an approach that could also be used in mkstatic to avoid the hard-coding issue (which seems to be causing at least some discomfort aside from the memory issue)?

And a side point, y'all must have a lot of memory to be able to run multiple instances in a single task (but maybe that's just the mental scar tissue left over from the Blue Gene era talking :)

climbfuji commented 1 year ago

Sorry for my late reply, I was on leave for the last 2+ weeks.

I think it's worth noting that @michalakes and @areinecke joined the CCPP developer meeting a few months ago to explain exactly what the problem is and how this PR addresses it. So I'd say they did what we expected them to do. Unfortunately @gold2718 wasn't on that call and therefore didn't get all the information. I get it that issues need to be created, but at the same time the developer meeting is the main system for discussing/proposing changes to the framework and if someone makes the effort to show up there that should be good enough.

Regarding the proposed solution for capgen above (https://github.com/NCAR/ccpp-framework/pull/463#issuecomment-1594278661), this is what I would have done for ccpp_prebuild as well. But, given that we'll be working actively on the unification and that capgen will replace prebuild in the near future, I thought it would be ok to hardcode it for now in prebuild.

gold2718 commented 1 year ago

I get it that issues need to be created, but at the same time the developer meeting is the main system for discussing/proposing changes to the framework and if someone makes the effort to show up there that should be good enough.

Sure but there is still a community that will not find out what is going on this way (heck, even I didn't). I took a minute and closed that gap (mostly referencing the really good description at NCAR/ccpp-physics#999).

michalakes commented 1 year ago

Thoughts on this approach? Is this an approach that could also be used in mkstatic to avoid the hard-coding issue (which seems to be causing at least some discomfort aside from the memory issue)?

I agree with the two reviewers who've approved the pull request and think the PR as it currently stands is good enough to go in as-is, with a more comprehensive approach to follow if needed. I'm a dilatant when it comes to the ins and outs of CCPP's infrastructure. I managed to learn just enough to incorporate and test the changes to get our project rolling again and to provide back to you in the form of the current PR. So beyond what I've already said, I don't have any additional thoughts on your proposal other than to encourage you all to go forward expeditiously in whatever manner satisfies your requirements and processes.

And a side point, y'all must have a lot of memory to be able to run multiple instances in a single task

The NEPTUNE code scales well both in memory and computationally. Dividing the job over a separate sets of tasks for each ensemble member consumes about the same total amount of memory as spreading the ensembles over all the tasks. There are good scaling efficiency arguments for having each ensemble member on its own set of tasks; but there are also advantages for computing the ensemble statistics having all the members available on each task. NEPTUNE supports either mode of operation, notwithstanding this issue with the physics.

gold2718 commented 1 year ago

I agree with the two reviewers who've approved the pull request and think the PR as it currently stands is good enough to go in as-is, with a more comprehensive approach to follow if needed.

So do we have a quorum? @mkavulich, @grantfirl, are there remaining objections to the short-term plan?

gold2718 commented 1 year ago

Dividing the job over a separate sets of tasks for each ensemble member consumes about the same total amount of memory as spreading the ensembles over all the tasks.

CAM scales as well but CAM-physics probably uses a lot more memory per column. Still, on runs where we run a few dozen columns per task, I suppose we could run a small ensemble instead (and use more cores). Interesting approach!

Thanks for bearing with me on the discussion (where I started way behind).

mkavulich commented 1 year ago

If I recall correctly it was mainly @climbfuji and @gold2718 who advocated for a less hard-coded definition here, so if both of you approve of the current version then I also approve. I'll double-check with @grantfirl to confirm before merging.

climbfuji commented 1 year ago

If I recall correctly it was mainly @climbfuji and @gold2718 who advocated for a less hard-coded definition here, so if both of you approve of the current version then I also approve. I'll double-check with @grantfirl to confirm before merging.

I approved it on Apr 29 with a note that in the future we may want to make it configurable (https://github.com/NCAR/ccpp-framework/pull/463#pullrequestreview-1406980493)

mkavulich commented 1 year ago

Sorry, I am having a bit of trouble following the conclusion here. We decided a couple meetings ago that I should try to pull out the number of instances into a more configurable location, and the PR I opened to John's branch (https://github.com/michalakes/ccpp-framework/pull/1) is the best effort I could come up with without making invasive changes to the CCPP prebuild. Is the decision to go forward without those modifications?

climbfuji commented 1 year ago

Sorry, I am having a bit of trouble following the conclusion here. We decided a couple meetings ago that I should try to pull out the number of instances into a more configurable location, and the PR I opened to John's branch (michalakes#1) is the best effort I could come up with without making invasive changes to the CCPP prebuild. Is the decision to go forward without those modifications?

I looked at your PR and it's totally fine using this. The back and forth here was about the nature of the change in general, i.e. creating multiple instances for members in one MPI task, and how to add this to capgen. I like your PR better than what is this PR now, but in the end it doesn't matter so much since capgen will use something different anyway.

gold2718 commented 1 year ago

Is the decision to go forward without those modifications?

Is there any reason why @michalakes can't just accept your PR and then you can merge to main?

michalakes commented 1 year ago

Is the decision to go forward without those modifications?

Is there any reason why @michalakes can't just accept your PR and then you can merge to main?

Is there a PR I need to accept? Sorry, I don't see it.

gold2718 commented 1 year ago

Is there a PR I need to accept? Sorry, I don't see it.

If I go to your fork and look at Pull Requests, I see this: https://github.com/michalakes/ccpp-framework/pull/1

Note this is a PR to your branch so it shows up on your fork, not on the NCAR organization page.

michalakes commented 1 year ago

If I go to your fork and look at Pull Requests, I see this: michalakes#1

Note this is a PR to your branch so it shows up on your fork, not on the NCAR organization page.

Got it, thanks. All merged. Thanks (That is, I have merged Mike's PR into my fork and it's ready to be merged to main, all-willing)

mkavulich commented 1 year ago

Thanks everyone for the conversation and getting this PR wrapped up. I will go ahead and merge now.

michalakes commented 1 year ago

Thanks, everyone!

From: Michael Kavulich @.> Sent: Tuesday, June 20, 2023 9:57 AM To: NCAR/ccpp-framework @.> Cc: John Michalakes @.>; Mention @.> Subject: Re: [NCAR/ccpp-framework] Support multiple concurrent instances per MPI task (PR #463)

Thanks everyone for the conversation and getting this PR wrapped up. I will go ahead and merge now.

— Reply to this email directly, view it on GitHub https://github.com/NCAR/ccpp-framework/pull/463#issuecomment-1599074754 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5QAL2SKOEVT6USM23ALTLXMHB5RANCNFSM6AAAAAAVLKP47Y . You are receiving this because you were mentioned. https://github.com/notifications/beacon/AB5QAL757ANCRYN5JEKWGZDXMHB5RA5CNFSM6AAAAAAVLKP472WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS7J7Y4E.gif Message ID: @. @.> >