MCSclimate / MCT

Model Coupling Tookit
Other
43 stars 18 forks source link

Possible bug in use RSENDBUF in Rearranger #51

Closed rljacob closed 6 years ago

rljacob commented 7 years ago

reported by @gold2718

There seems to be a bug which has crept into MCT since the last merge into CIME. It seems to be related to Pat Worley's or possibly Azmat's changes. I note that GNU does not pick up this error but Intel does.

The test is: SMS_D_Ln9.f19_g16_rx1.A.yellowstone_intel although I also tried this on other machines. The error is:

  14:forrtl: severe (408): fort: (8): Attempt to fetch from allocatable variable
 RSENDBUF when it is not allocated
  14:
  14:Image              PC                Routine            Line        Source 

  14:cesm.exe           00000000017E39A0  Unknown               Unknown  Unknown
  14:cesm.exe           00000000012F1BA2  m_rearranger_mp_r        1165  m_Rearranger.F90
  14:cesm.exe           00000000012A9FFD  m_matattrvectmul_         514  m_MatAttrVectMul.F90
  14:cesm.exe           0000000000D5F4DE  shr_strdata_mod_m         754  shr_strdata_mod.F90
  14:cesm.exe           00000000008C4AE7  datm_comp_mod_mp_         667  datm_comp_mod.F90
  14:cesm.exe           00000000008C41AE  datm_comp_mod_mp_         578  datm_comp_mod.F90
  14:cesm.exe           00000000008B9525  atm_comp_mct_mp_a         169  atm_comp_mct.F90
  14:cesm.exe           0000000000483F4D  component_mod_mp_         229  component_mod.F90
  14:cesm.exe           000000000044B2C8  cime_comp_mod_mp_        1188  cime_comp_mod.F90
  14:cesm.exe           0000000000478E10  MAIN__                     63  cime_driver.F90
  14:cesm.exe           0000000000436F9E  Unknown               Unknown  Unknown
ime_comp_mod.F90
  14:cesm.exe           0000000000478E10  MAIN__                     63  cime_driver.F90
  14:cesm.exe           0000000000436F9E  Unknown               Unknown  Unknown
rljacob commented 7 years ago

@amametjanov or @worleyph can one of you look at this? You should be able to run SMS_D_Ln9.f19_g16_rx1.A on any machine with intel.

worleyph commented 7 years ago

Looking at the code I do not see an error, but I'll try running an experiment.

On 10/15/17 10:09 PM, Robert Jacob wrote:

@amametjanov https://github.com/amametjanov or @worleyph https://github.com/worleyph can one of you look at this? You should be able to run SMS_D_Ln9.f19_g16_rx1.A on any machine with intel.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MCSclimate/MCT/issues/51#issuecomment-336762122, or mute the thread https://github.com/notifications/unsubscribe-auth/AHghFGvWvAkamG6OX70ad9blOzHTRgDPks5ssrrWgaJpZM4P5_5-.

worleyph commented 7 years ago

I haven't run a test for a long time, but

 ./create_test SMS_D_Ln9.f19_g16_rx1.A --machine titan --compiler intel --project cli115

on Titan did not die with the indicated error. It seems to have run successfully as well, producing timing output and log files as well with an Exit_status=0 and

 PASS SMS_D_Ln9.f19_g16_rx1.A.titan_intel RUN time=119

I also tried

 ./create_test SMS_D_Ln9.f19_g16_rx1.A --machine anvil --compiler intel

on Anvil, and this also ran to completion:

 PASS SMS_D_Ln9.f19_g16_rx1.A.anvil_intel RUN time=49

Where would I run to see this error (other than Yellowstone)?

worleyph commented 7 years ago
 ./create_test SMS_D_Ln9.f19_g16_rx1.A --machine cori-knl --compiler intel --project acme --queue debug

also worked on Cori-KNL:

PASS SMS_D_Ln9.f19_g16_rx1.A.cori-knl_intel RUN time=237

as did

 ./create_test SMS_D_Ln9.f19_g16_rx1.A --machine edison  --compiler intel --project acme --queue debug

on Edison:

PASS SMS_D_Ln9.f19_g16_rx1.A.edison_intel RUN time=58

I vaguely remember an error from the lack of RSendBuf allocation in the past, though I can't find e-mail on the topic, but the consistent use of the logic

 if(numr .ge. 1) then

corrected the problem in my recollection. This is still in there, and correct as far as I can tell.

worleyph commented 7 years ago

Note that I changed the name of the issue. RSENDBUF is "real send buffer" (as distinct from the "integer send buffer"). It has nothing to do with the RSEND protocol.

worleyph commented 7 years ago

I see one possible source for the problem.

 ! IF SENDING DATA
 if(SendRout%nprocs > 0) then
 ...
    if(numr .ge. 1) then
    ...
       allocate(RSendBuf(RSendSize),stat=ier)

and

 if (usealltoall) then
   if (useswapm) then
     else
       if (numr .ge. 1) then
           call MPI_Alltoallv(RSendBuf, RSendCnts, RSdispls, mp_Type_rp, &
                    RRecvBuf, RRecvCnts, RRdispls, mp_Type_rp, &
                    ThisMCTWorld%MCT_comm,ier)

so RSendBuf is not allocated if(SendRout%nprocs = 0) yet MPI_Alltoallv is called. Guess that we always use swapm when setting usealltoall in ACME, so our tests are not failing?

I sure thought that this had been identified and fixed awhile back though.

worleyph commented 7 years ago

Note that this logic has been in there since CIME2 at least.

worleyph commented 7 years ago

Okay, I found it. This was a bug reported by Sheri Mickelson in 2010, and I supplied a fix to Sheri at the time. Apparently it never made it into the model, or another approach was taken instead. Should I resubmit my fix as a PR as ask for a review?

I have no idea why this suddenly showed up as an error now (and not any time in the last 7 years).

rljacob commented 7 years ago

Its an MCT problem? Yes submit a PR here.

worleyph commented 7 years ago

Okay. This modification was not adopted in 2010. I'll try again.

worleyph commented 7 years ago

So, I modified MCT in ACME to force it to set usealltoall to .true. and useswapm to .false. and tried

 ./create_test SMS_D_Ln9.f19_g16_rx1.A --machine titan --compiler intel --project cli115

again, and it still did not fail. So, I have no idea if the "fix" in PR #52 addresses the problem that @gold2718 observed or not since I cannot reproduce it. Maybe I got lucky and fixed the problem, but I have no way of telling.

rljacob commented 7 years ago

Thanks Pat. I think you've done your due diligence. I'll wait to see if @gold2718 can confirm.