cornelisnetworks / opa-psm2

Other
37 stars 29 forks source link

crash when sending very large messages #23

Closed mattijsjanssens closed 6 years ago

mattijsjanssens commented 6 years ago

We're occasionally seeing assert message of the form

ips_proto.c:1646: (scb->payload_size & 0x3) == 0

which seem to originate from somewhere in the network stack (e.g. https://github.com/intel/opa-psm2/blob/master/ptl_ips/ips_proto.c#L1957) when the size is not a multiple of 4.

Is this a known problem? We don't pad our mpi messages to be multiple of 4 bytes. Should we? If so why does it not show up on ordinary usage (i.e. smaller messages).

aravindksg commented 6 years ago

We have seen this problem appear with message sizes that are not DW multiple before, but the issue was fixed. (as of PSM2 version :PSM2_10.2-235) Also- the line numbers you posted above do not match: ips_proto.c:1646 where your execution is failing and current location of assert (in latest PSM2 master) is ips_proto.c: 1957. Could you clarify if you are actually using the latest PSM2 version from GitHub or a different PSM2 version (either from distro or from IFS)? If it is indeed an older version, could you please update to latest GitHub master and retry?

mattijsjanssens commented 6 years ago

Thanks for the answer. I will check.

rwmcguir commented 6 years ago

Can this issue be closed or is this still a problem?

mattijsjanssens commented 6 years ago

From what I'm told it can probably be closed. Many thanks for the feedback.

Mattijs

On 26 April 2018 at 17:25, Russell McGuire notifications@github.com wrote:

Can this issue be closed or is this still a problem?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/intel/opa-psm2/issues/23#issuecomment-384704085, or mute the thread https://github.com/notifications/unsubscribe-auth/AL-eP6QymF_FyJLLpO6gOuuHgwRZwwWAks5tsfT2gaJpZM4RshQt .

rwmcguir commented 6 years ago

Thank you for confirming.