Open rljacob opened 8 years ago
Tagging @amametjanov , @wlin7 , @yjinho , @vlarson
At least most of the production machines (e.g., Edison, Mira, Titan) have been using threading for compsets that use CLUBB.
Anyway that we can determine who added that comment (and then, why)?
@worleyph:
Here is the detective work:
Looks like it came from a cime merge: https://github.com/ACME-Climate/ACME/blame/master/cime/machines-acme/config_pes.xml#L1147
Which in cime is: https://github.com/ACME-Climate/cime/commit/713acff
Which looks like it came from a copy of the machines file: https://github.com/ACME-Climate/cime/blob/713acff650adc79d9e60bfdbaf4eaef95d54444e/machines-acme/config_pes.xml#L1574
Which looks like it came from Jim Edwards (during the initial setup of cime) https://github.com/ACME-Climate/cime/blame/713acff650adc79d9e60bfdbaf4eaef95d54444e/machines/config_pes.xml#L1574
So, maybe if we ask the NCAR CIME folks, they might know if it's broken and if so why.
Thanks @douglasjacobsen . @rljacob , looks like this is on your todo list? Unless @douglasjacobsen wants to ask.
Just a heads up, I don't see this comment in CESM's version of config_pes.xml for CAM. Not sure if that means it works for them or just they don't know if it works in that version.
Well, if this came in with the original CIME, it is pretty old, and we have debugged CLUBB since then, so it is probably just an out-of-date comment. Still worth asking to make sure.
I asked Jim Edwards, and he said the particular issue that lead to that comment was fixed in a CAM tag about 4-6 months ago. He said he thinks the issue was a race condition.
He also said he thinks that threading working in CLUBB depends on the CLUBB version, and that it's a work in progress. So, I don't know what that means for the version that ACME has.
This is a question for the atmosphere group then - they need to be talking to the CLUBB developers about what is expected to work and what isn't. I haven't seen any problems with CLUBB in my own runs recently, threaded or not. It would be nice to understand the error signature. @philrasch , is this something that you could assign to someone to track down?
I've been running the case -mach edison -compset FC5AV1C-01 -res ne30_oEC
, which includes -clubb_sgs
in CAM_CONFIG_OPTS, and have been getting BFB results.
@singhbalwinder, could you check if this multi-threading test works for you
If it runs okay, the PE block can probably be removed?
Update: the PET_PT_Ln9.ne30_ne30.FC5AV1C-L.edison_intel
test passed:
So the comment appears to be out-of-date.
Recent CAM bugfix suggests that CLUBB is thread safe, EXCEPT when clubb_history is enabled:
"Reintroduced the check in build-namelist to prohibit running with more than one thread if clubb_history is enabled. This feature does not run properly when more than when thread is used."
Should I verify if this is the case? I can run a PET_PT_Ln9.ne30_ne30.FC5AV1C-L case with clubb_history on.
I thought we should at least document this issue and keep it open for now. Not sure if it is worth investigating right now, unless you often run with clubb_history enabled.
We could backport this NCAR change, to abort in case someone tries to use this with threading. Seems like a simple way of insuring that we don't forget :-).
In a recent CAM tag, NCAR did the following for CLUBB: -- Reintroduced the check in build-namelist to prohibit running with more than one thread if clubb_history is enabled. This feature does not run properly when more than when thread is used. -- Removed the ability to turn on clubb_do_deep as it was an unsupported feature
Should I incorporate this change into ACME? I think this will address the concerns raised in this issue.
Is clubb_history not something we want "on" for all production runs?
I am not sure about that. @wlin7 , would you please respond to Rob's question?
Hi @rljacob , @singhbalwinder , it is for recording useful statistics for the clubb sub-model if wanted to dig deeper into how clubb does the works. But no, it is not something we want 'on' for typical production runs.
Thanks @wlin7 ! In that case, I think the NCAR fix can serve our needs.
This comment appears in config_pes.xml around line 1147: (noticed by @polunma )
A_WCYCL cases use CLUBB but we've done nothing to ensure those runs are single threaded.
What is the origin of this comment? Is it still true?