Closed rljacob closed 6 years ago
@amametjanov or @worleyph can one of you look at this? You should be able to run SMS_D_Ln9.f19_g16_rx1.A on any machine with intel.
Looking at the code I do not see an error, but I'll try running an experiment.
On 10/15/17 10:09 PM, Robert Jacob wrote:
@amametjanov https://github.com/amametjanov or @worleyph https://github.com/worleyph can one of you look at this? You should be able to run SMS_D_Ln9.f19_g16_rx1.A on any machine with intel.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MCSclimate/MCT/issues/51#issuecomment-336762122, or mute the thread https://github.com/notifications/unsubscribe-auth/AHghFGvWvAkamG6OX70ad9blOzHTRgDPks5ssrrWgaJpZM4P5_5-.
I haven't run a test for a long time, but
./create_test SMS_D_Ln9.f19_g16_rx1.A --machine titan --compiler intel --project cli115
on Titan did not die with the indicated error. It seems to have run successfully as well, producing timing output and log files as well with an Exit_status=0 and
PASS SMS_D_Ln9.f19_g16_rx1.A.titan_intel RUN time=119
I also tried
./create_test SMS_D_Ln9.f19_g16_rx1.A --machine anvil --compiler intel
on Anvil, and this also ran to completion:
PASS SMS_D_Ln9.f19_g16_rx1.A.anvil_intel RUN time=49
Where would I run to see this error (other than Yellowstone)?
./create_test SMS_D_Ln9.f19_g16_rx1.A --machine cori-knl --compiler intel --project acme --queue debug
also worked on Cori-KNL:
PASS SMS_D_Ln9.f19_g16_rx1.A.cori-knl_intel RUN time=237
as did
./create_test SMS_D_Ln9.f19_g16_rx1.A --machine edison --compiler intel --project acme --queue debug
on Edison:
PASS SMS_D_Ln9.f19_g16_rx1.A.edison_intel RUN time=58
I vaguely remember an error from the lack of RSendBuf allocation in the past, though I can't find e-mail on the topic, but the consistent use of the logic
if(numr .ge. 1) then
corrected the problem in my recollection. This is still in there, and correct as far as I can tell.
Note that I changed the name of the issue. RSENDBUF is "real send buffer" (as distinct from the "integer send buffer"). It has nothing to do with the RSEND protocol.
I see one possible source for the problem.
! IF SENDING DATA
if(SendRout%nprocs > 0) then
...
if(numr .ge. 1) then
...
allocate(RSendBuf(RSendSize),stat=ier)
and
if (usealltoall) then
if (useswapm) then
else
if (numr .ge. 1) then
call MPI_Alltoallv(RSendBuf, RSendCnts, RSdispls, mp_Type_rp, &
RRecvBuf, RRecvCnts, RRdispls, mp_Type_rp, &
ThisMCTWorld%MCT_comm,ier)
so RSendBuf is not allocated if(SendRout%nprocs = 0) yet MPI_Alltoallv is called. Guess that we always use swapm when setting usealltoall in ACME, so our tests are not failing?
I sure thought that this had been identified and fixed awhile back though.
Note that this logic has been in there since CIME2 at least.
Okay, I found it. This was a bug reported by Sheri Mickelson in 2010, and I supplied a fix to Sheri at the time. Apparently it never made it into the model, or another approach was taken instead. Should I resubmit my fix as a PR as ask for a review?
I have no idea why this suddenly showed up as an error now (and not any time in the last 7 years).
Its an MCT problem? Yes submit a PR here.
Okay. This modification was not adopted in 2010. I'll try again.
So, I modified MCT in ACME to force it to set usealltoall to .true. and useswapm to .false. and tried
./create_test SMS_D_Ln9.f19_g16_rx1.A --machine titan --compiler intel --project cli115
again, and it still did not fail. So, I have no idea if the "fix" in PR #52 addresses the problem that @gold2718 observed or not since I cannot reproduce it. Maybe I got lucky and fixed the problem, but I have no way of telling.
Thanks Pat. I think you've done your due diligence. I'll wait to see if @gold2718 can confirm.
reported by @gold2718
There seems to be a bug which has crept into MCT since the last merge into CIME. It seems to be related to Pat Worley's or possibly Azmat's changes. I note that GNU does not pick up this error but Intel does.
The test is:
SMS_D_Ln9.f19_g16_rx1.A.yellowstone_intel
although I also tried this on other machines. The error is: