E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
352 stars 365 forks source link

Is threading broken for CLUBB? #819

Open rljacob opened 8 years ago

rljacob commented 8 years ago

This comment appears in config_pes.xml around line 1147: (noticed by @polunma )

<!-- Threading is broken for CLUBB.  Only use 1 thread with CLUBB
     compsets.  -->  
<pes CCSM_LCOMPSET="CLB">
    <NTHRDS_ATM>1</NTHRDS_ATM>
    <NTHRDS_LND>1</NTHRDS_LND>
    <NTHRDS_ROF>1</NTHRDS_ROF>
    <NTHRDS_ICE>1</NTHRDS_ICE>
    <NTHRDS_OCN>1</NTHRDS_OCN>
    <NTHRDS_CPL>1</NTHRDS_CPL>
    <NTHRDS_GLC>1</NTHRDS_GLC>
</pes>

A_WCYCL cases use CLUBB but we've done nothing to ensure those runs are single threaded.

What is the origin of this comment? Is it still true?

rljacob commented 8 years ago

Tagging @amametjanov , @wlin7 , @yjinho , @vlarson

wlin7 commented 8 years ago

At least most of the production machines (e.g., Edison, Mira, Titan) have been using threading for compsets that use CLUBB.

worleyph commented 8 years ago

Anyway that we can determine who added that comment (and then, why)?

douglasjacobsen commented 8 years ago

@worleyph:

Here is the detective work:

Looks like it came from a cime merge: https://github.com/ACME-Climate/ACME/blame/master/cime/machines-acme/config_pes.xml#L1147

Which in cime is: https://github.com/ACME-Climate/cime/commit/713acff

Which looks like it came from a copy of the machines file: https://github.com/ACME-Climate/cime/blob/713acff650adc79d9e60bfdbaf4eaef95d54444e/machines-acme/config_pes.xml#L1574

Which looks like it came from Jim Edwards (during the initial setup of cime) https://github.com/ACME-Climate/cime/blame/713acff650adc79d9e60bfdbaf4eaef95d54444e/machines/config_pes.xml#L1574

So, maybe if we ask the NCAR CIME folks, they might know if it's broken and if so why.

worleyph commented 8 years ago

Thanks @douglasjacobsen . @rljacob , looks like this is on your todo list? Unless @douglasjacobsen wants to ask.

douglasjacobsen commented 8 years ago

Just a heads up, I don't see this comment in CESM's version of config_pes.xml for CAM. Not sure if that means it works for them or just they don't know if it works in that version.

worleyph commented 8 years ago

Well, if this came in with the original CIME, it is pretty old, and we have debugged CLUBB since then, so it is probably just an out-of-date comment. Still worth asking to make sure.

douglasjacobsen commented 8 years ago

I asked Jim Edwards, and he said the particular issue that lead to that comment was fixed in a CAM tag about 4-6 months ago. He said he thinks the issue was a race condition.

He also said he thinks that threading working in CLUBB depends on the CLUBB version, and that it's a work in progress. So, I don't know what that means for the version that ACME has.

worleyph commented 8 years ago

This is a question for the atmosphere group then - they need to be talking to the CLUBB developers about what is expected to work and what isn't. I haven't seen any problems with CLUBB in my own runs recently, threaded or not. It would be nice to understand the error signature. @philrasch , is this something that you could assign to someone to track down?

amametjanov commented 8 years ago

I've been running the case -mach edison -compset FC5AV1C-01 -res ne30_oEC, which includes -clubb_sgs in CAM_CONFIG_OPTS, and have been getting BFB results.

@singhbalwinder, could you check if this multi-threading test works for you

If it runs okay, the PE block can probably be removed?

amametjanov commented 8 years ago

Update: the PET_PT_Ln9.ne30_ne30.FC5AV1C-L.edison_intel test passed:

So the comment appears to be out-of-date.

mt5555 commented 8 years ago

Recent CAM bugfix suggests that CLUBB is thread safe, EXCEPT when clubb_history is enabled:

"Reintroduced the check in build-namelist to prohibit running with more than one thread if clubb_history is enabled. This feature does not run properly when more than when thread is used."

singhbalwinder commented 8 years ago

Should I verify if this is the case? I can run a PET_PT_Ln9.ne30_ne30.FC5AV1C-L case with clubb_history on.

mt5555 commented 8 years ago

I thought we should at least document this issue and keep it open for now. Not sure if it is worth investigating right now, unless you often run with clubb_history enabled.

worleyph commented 8 years ago

We could backport this NCAR change, to abort in case someone tries to use this with threading. Seems like a simple way of insuring that we don't forget :-).

singhbalwinder commented 8 years ago

In a recent CAM tag, NCAR did the following for CLUBB: -- Reintroduced the check in build-namelist to prohibit running with more than one thread if clubb_history is enabled. This feature does not run properly when more than when thread is used. -- Removed the ability to turn on clubb_do_deep as it was an unsupported feature

Should I incorporate this change into ACME? I think this will address the concerns raised in this issue.

rljacob commented 8 years ago

Is clubb_history not something we want "on" for all production runs?

singhbalwinder commented 8 years ago

I am not sure about that. @wlin7 , would you please respond to Rob's question?

wlin7 commented 8 years ago

Hi @rljacob , @singhbalwinder , it is for recording useful statistics for the clubb sub-model if wanted to dig deeper into how clubb does the works. But no, it is not something we want 'on' for typical production runs.

singhbalwinder commented 8 years ago

Thanks @wlin7 ! In that case, I think the NCAR fix can serve our needs.