charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
203 stars 49 forks source link

In partitioned runs, abort message should include partition ID #400

Closed PhilMiller closed 8 years ago

PhilMiller commented 10 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/400


Right now, when some code calls CkAbort or CmiAbort, the resulting message on standard error includes the PE making the call. In a partitioned run, there are many PEs that share the same number, one in each partition. In that case, the message should include the partition number as well, to remove ambiguity.

Per Jim:

I know it's a little late in the 6.6 cycle for this, but this is a major usability issue with no userland workaround.

Testing the MIC code on stampede with a 2048-node run, one partition per node, I get the following in stderr:

offload error: process on the device 0 was unexpectedly terminated
------------- Processor 8 Exiting: Called CmiAbort ------------
Reason: unexpected call to exit by user program. Must use CkExit, not exit!
Fatal error on PE 8> unexpected call to exit by user program. Must use CkExit, not exit!

This is of course completely useless since I don't know which partition generated the error. The NAMD_die messages are much more useful:

Reason: REPLICA 1283 FATAL ERROR: MIC error on Pe 2 (c510-302.stampede.tacc.utexas.edu): No MIC devices found.

The replica index is only printed when CmiNumPartitions() > 1 and is only printed on stderr. A normal message is also printed on stdout.

The changes would need to be in charmrun_abort(), LrtsAbort(), and KillOnAllSigs(), and of course only for the lrts runtimes (pamilrts, netlrts, verbs, and mpi).

PhilMiller commented 5 years ago

Original date: 2014-01-23 20:29:18


I'll do some refactoring, to try to get consistency across machine layers in the process.

PhilMiller commented 5 years ago

Original date: 2014-01-24 17:57:38


Implemented and merged for CmiAbort. Not yet done for charmrun_abort or KillOnAllSigs.

PhilMiller commented 5 years ago

Original date: 2014-01-30 19:36:20


Remaining sub-unit deferred.

PhilMiller commented 5 years ago

Original date: 2016-02-18 04:34:45


All subtasks had implementations merged before 6.7.0 was released!