charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
203 stars 49 forks source link

verbs crash with bad packet length on stampede #803

Closed trquinn closed 9 years ago

trquinn commented 9 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/803


The latest charm version (v6.6.0-317-g80fea48) crashes a ChaNGa run on stampede with:

Fatal error on PE 502> packet in the middle does not have expected length

An earlier version (v6.6.0-258-ge8e17df) works.

nikhil-jain commented 5 years ago

Original date: 2015-08-12 02:49:45


Bilge, please judge if its related to the broadcast change, and if not, reassign to Changa group.

bilgeacun commented 5 years ago

Original date: 2015-08-12 14:21:40


It could be related to my recent broadcast change. (b1db2f25a931534c555aadd14740f8f6831bf9ae)

Tom, can you tell the parameters you're using for ChaNGa when you get this crash? I'll try to reproduce the issue.

trquinn commented 5 years ago

Original date: 2015-08-12 19:02:13


I used "git bisect" to figure out which commit caused the problem. It tells me: login3.stampede(17)$ git bisect bad b1db2f25a931534c555aadd14740f8f6831bf9ae is the first bad commit

To reproduce on Stampede: 1) In charm, "./build ChaNGa verbs-linux-x86_64 smp -j4 -O2" 2) In ChaNGa, "./configure --enable-bigkeys; make" 3) Run ChaNGa with the attached job script, and param file. The data file is too large to attach, but can be found at: ftp://ftp-hpcc.astro.washington.edu/pub/hpcc/bench/hrwh_sbc_gas.tbin

The files are also on Stampede in the directory /home1/00333/tg456090/work/hrwh_sbc_g

HRWH_sbc_g.param hrwh.qsub

bilgeacun commented 5 years ago

Original date: 2015-08-17 19:59:45


Thanks Tom, so it's my recent change is causing the problem. I have reproduced the crash and working on fixing it.

trquinn commented 5 years ago

Original date: 2015-09-10 22:01:07


Any progress on this? Having charm broken on the ibverbs platform is not good.

bilgeacun commented 5 years ago

Original date: 2015-09-11 13:35:12


I'm going to look into this after my paper deadline this weekend.

bilgeacun commented 5 years ago

Original date: 2015-09-14 17:56:25


Fix is implemented here: https://charm.cs.illinois.edu/gerrit/#/c/827/ https://github.com/UIUC-PPL/charm/commit/7f5f80087cdd1e4fb075e274bfe629357cbf9364 I've tested and Changa works fine now. Tom, can you test it as well please?

bilgeacun commented 5 years ago

Original date: 2015-09-17 19:48:15


The fix is merged.

PhilMiller commented 5 years ago

Original date: 2015-11-11 03:05:23


Set status back to Merged so that we can distinguish whether the code was changed in some way, or the fix was elsewhere. Both are non-open states from Redmine's perspective.