cyclops-community / ctf

Cyclops Tensor Framework: parallel arithmetic on multidimensional arrays
Other
199 stars 53 forks source link

Better messages for unexpected exits #15

Closed devinamatthews closed 8 years ago

devinamatthews commented 8 years ago

It would be great if there was some message indicated why CTF is exiting abnormally, something like:

This would make tracking down issues much easier and help to distinguish from genuine segfaults. Also, wouldn't it be better to use MPI_Abort rather than dereference NULL?

solomonik commented 8 years ago

I dereference NULL in asserts when -DDEBUG is on, because this allows one to get a trace via valgrind (or gdb) rather than when the application exits normally or with MPI abort. This is how I always debug and what I would also recommend using valgrind for you. I was not able to find a portable solution for printing a stack trace manually.

When -DDEBUG is not on, when CTF detects an error it does some printf with info and exits gracefully, but there are not nearly as many checks for performance reasons. On Jan 11, 2016 7:09 PM, "Devin Matthews" notifications@github.com wrote:

It would be great if there was some message indicated why CTF is exiting abnormally, something like:

  • "Out of memory" (or "posix_memalign returned XXX")
  • "Assertion failed on line XXX of YYY"

This would make tracking down issues much easier and help to distinguish from genuine segfaults. Also, wouldn't it be better to use MPI_Abort rather than dereference NULL?

— Reply to this email directly or view it on GitHub https://github.com/solomonik/ctf/issues/15.

devinamatthews commented 8 years ago

However it exits, something like ASSERT(condition, message) could print the message on failure (and with a macro wrapper automatically give the file and line). For example, I use the following macro in one of my codes:

#define ALWAYS_ASSERT(cond,...) \
do { \
     if (!(cond)) \
     { \
         MASTER \
         { \
             fprintf(stderr, "%s(%d): assertion failed: ", __FILE__, 
__LINE__); \
             if (strcmp("()", STRINGIFY((__VA_ARGS__))) == 0) \
                 fprintf(stderr, STRINGIFY(cond)); \
             else \
                 fprintf(stderr, "" __VA_ARGS__); \
             fprintf(stderr, "\n"); \
         } \
         abort(); \
     } \
} while (0)

which support both ASSERT(cond) and ASSERT(cond,message) styles (note that MASTER is another macro which checks rank==0).

On 1/11/16 12:16 PM, Edgar Solomonik wrote:

I dereference NULL in asserts when -DDEBUG is on, because this allows one to get a trace via valgrind (or gdb) rather than when the application exits normally or with MPI abort. This is how I always debug and what I would also recommend using valgrind for you. I was not able to find a portable solution for printing a stack trace manually.

When -DDEBUG is not on, when CTF detects an error it does some printf with info and exits gracefully, but there are not nearly as many checks for performance reasons. On Jan 11, 2016 7:09 PM, "Devin Matthews" notifications@github.com wrote:

It would be great if there was some message indicated why CTF is exiting abnormally, something like:

  • "Out of memory" (or "posix_memalign returned XXX")
  • "Assertion failed on line XXX of YYY"

This would make tracking down issues much easier and help to distinguish from genuine segfaults. Also, wouldn't it be better to use MPI_Abort rather than dereference NULL?

— Reply to this email directly or view it on GitHub https://github.com/solomonik/ctf/issues/15.

— Reply to this email directly or view it on GitHub https://github.com/solomonik/ctf/issues/15#issuecomment-170639004.

devinamatthews commented 8 years ago

Apparently Markdown doesn't work from e-mail...

solomonik commented 8 years ago

Sure I'll add a printf, its just that 99% of the time I am debugging I want the full trace anyway, where this information would become redundant. The location of the assert by itself is rarely enough info to debug anything. On Jan 11, 2016 7:20 PM, "Devin Matthews" notifications@github.com wrote:

However it exits, something like ASSERT(condition, message) could print the message on failure (and with a macro wrapper automatically give the file and line). For example, I use the following macro in one of my codes:

#define ALWAYS_ASSERT(cond,...) \
do { \
if (!(cond)) \
{ \
MASTER \
{ \
fprintf(stderr, "%s(%d): assertion failed: ", __FILE__,
__LINE__); \
if (strcmp("()", STRINGIFY((__VA_ARGS__))) == 0) \
fprintf(stderr, STRINGIFY(cond)); \
else \
fprintf(stderr, "" __VA_ARGS__); \
fprintf(stderr, "\n"); \
} \
abort(); \
} \
} while (0)

which support both ASSERT(cond) and ASSERT(cond,message) styles (note that MASTER is another macro which checks rank==0).

On 1/11/16 12:16 PM, Edgar Solomonik wrote:

I dereference NULL in asserts when -DDEBUG is on, because this allows one to get a trace via valgrind (or gdb) rather than when the application exits normally or with MPI abort. This is how I always debug and what I would also recommend using valgrind for you. I was not able to find a portable solution for printing a stack trace manually.

When -DDEBUG is not on, when CTF detects an error it does some printf with info and exits gracefully, but there are not nearly as many checks for performance reasons. On Jan 11, 2016 7:09 PM, "Devin Matthews" notifications@github.com wrote:

It would be great if there was some message indicated why CTF is exiting abnormally, something like:

  • "Out of memory" (or "posix_memalign returned XXX")
  • "Assertion failed on line XXX of YYY"

This would make tracking down issues much easier and help to distinguish from genuine segfaults. Also, wouldn't it be better to use MPI_Abort rather than dereference NULL?

— Reply to this email directly or view it on GitHub https://github.com/solomonik/ctf/issues/15.

— Reply to this email directly or view it on GitHub https://github.com/solomonik/ctf/issues/15#issuecomment-170639004.

— Reply to this email directly or view it on GitHub https://github.com/solomonik/ctf/issues/15#issuecomment-170640575.

devinamatthews commented 8 years ago

It helps a lot when you're not running in a debugger the first time or when the stack is corrupted. Also, I think a lot of these asserts could be left on without much (any?) performance penalty. Doing so, or having a mode to turn on assert but not DEBUG printf's would be helpful in catching errors like I had recently.

On 1/11/16 12:24 PM, Edgar Solomonik wrote:

Sure I'll add a printf, its just that 99% of the time I am debugging I want the full trace anyway, where this information would become redundant. The location of the assert by itself is rarely enough info to debug anything. On Jan 11, 2016 7:20 PM, "Devin Matthews" notifications@github.com wrote:

However it exits, something like ASSERT(condition, message) could print the message on failure (and with a macro wrapper automatically give the file and line). For example, I use the following macro in one of my codes:

#define ALWAYS_ASSERT(cond,...) \
do { \
if (!(cond)) \
{ \
MASTER \
{ \
fprintf(stderr, "%s(%d): assertion failed: ", __FILE__,
__LINE__); \
if (strcmp("()", STRINGIFY((__VA_ARGS__))) == 0) \
fprintf(stderr, STRINGIFY(cond)); \
else \
fprintf(stderr, "" __VA_ARGS__); \
fprintf(stderr, "\n"); \
} \
abort(); \
} \
} while (0)

which support both ASSERT(cond) and ASSERT(cond,message) styles (note that MASTER is another macro which checks rank==0).

On 1/11/16 12:16 PM, Edgar Solomonik wrote:

I dereference NULL in asserts when -DDEBUG is on, because this allows one to get a trace via valgrind (or gdb) rather than when the application exits normally or with MPI abort. This is how I always debug and what I would also recommend using valgrind for you. I was not able to find a portable solution for printing a stack trace manually.

When -DDEBUG is not on, when CTF detects an error it does some printf with info and exits gracefully, but there are not nearly as many checks for performance reasons. On Jan 11, 2016 7:09 PM, "Devin Matthews" notifications@github.com wrote:

It would be great if there was some message indicated why CTF is exiting abnormally, something like:

  • "Out of memory" (or "posix_memalign returned XXX")
  • "Assertion failed on line XXX of YYY"

This would make tracking down issues much easier and help to distinguish from genuine segfaults. Also, wouldn't it be better to use MPI_Abort rather than dereference NULL?

— Reply to this email directly or view it on GitHub https://github.com/solomonik/ctf/issues/15.

— Reply to this email directly or view it on GitHub https://github.com/solomonik/ctf/issues/15#issuecomment-170639004.

— Reply to this email directly or view it on GitHub https://github.com/solomonik/ctf/issues/15#issuecomment-170640575.

— Reply to this email directly or view it on GitHub https://github.com/solomonik/ctf/issues/15#issuecomment-170642318.

jeffhammond commented 8 years ago

That is correct. You can't even clean it up later (I've tried).

Sent from my iPhone

On Jan 11, 2016, at 10:24 AM, Devin Matthews notifications@github.com wrote:

Apparently Markdown doesn't work from e-mail...

— Reply to this email directly or view it on GitHub.

jeffhammond commented 8 years ago

We should aspire to support backtrack under Linux at least. I think it's feasible. Will investigate later.

Blue Gene doesn't support Valgrind.

Sent from my iPhone

On Jan 11, 2016, at 10:32 AM, Devin Matthews notifications@github.com wrote:

It helps a lot when you're not running in a debugger the first time or when the stack is corrupted. Also, I think a lot of these asserts could be left on without much (any?) performance penalty. Doing so, or having a mode to turn on assert but not DEBUG printf's would be helpful in catching errors like I had recently.

On 1/11/16 12:24 PM, Edgar Solomonik wrote:

Sure I'll add a printf, its just that 99% of the time I am debugging I want the full trace anyway, where this information would become redundant. The location of the assert by itself is rarely enough info to debug anything. On Jan 11, 2016 7:20 PM, "Devin Matthews" notifications@github.com wrote:

However it exits, something like ASSERT(condition, message) could print the message on failure (and with a macro wrapper automatically give the file and line). For example, I use the following macro in one of my codes:

#define ALWAYS_ASSERT(cond,...) \
do { \
if (!(cond)) \
{ \
MASTER \
{ \
fprintf(stderr, "%s(%d): assertion failed: ", __FILE__,
__LINE__); \
if (strcmp("()", STRINGIFY((__VA_ARGS__))) == 0) \
fprintf(stderr, STRINGIFY(cond)); \
else \
fprintf(stderr, "" __VA_ARGS__); \
fprintf(stderr, "\n"); \
} \
abort(); \
} \
} while (0)

which support both ASSERT(cond) and ASSERT(cond,message) styles (note that MASTER is another macro which checks rank==0).

On 1/11/16 12:16 PM, Edgar Solomonik wrote:

I dereference NULL in asserts when -DDEBUG is on, because this allows one to get a trace via valgrind (or gdb) rather than when the application exits normally or with MPI abort. This is how I always debug and what I would also recommend using valgrind for you. I was not able to find a portable solution for printing a stack trace manually.

When -DDEBUG is not on, when CTF detects an error it does some printf with info and exits gracefully, but there are not nearly as many checks for performance reasons. On Jan 11, 2016 7:09 PM, "Devin Matthews" notifications@github.com wrote:

It would be great if there was some message indicated why CTF is exiting abnormally, something like:

  • "Out of memory" (or "posix_memalign returned XXX")
  • "Assertion failed on line XXX of YYY"

This would make tracking down issues much easier and help to distinguish from genuine segfaults. Also, wouldn't it be better to use MPI_Abort rather than dereference NULL?

— Reply to this email directly or view it on GitHub https://github.com/solomonik/ctf/issues/15.

— Reply to this email directly or view it on GitHub https://github.com/solomonik/ctf/issues/15#issuecomment-170639004.

— Reply to this email directly or view it on GitHub https://github.com/solomonik/ctf/issues/15#issuecomment-170640575.

— Reply to this email directly or view it on GitHub https://github.com/solomonik/ctf/issues/15#issuecomment-170642318.

— Reply to this email directly or view it on GitHub.

devinamatthews commented 8 years ago

On that note, libunwind is a great tool for this, but is there any library equivalent of addr2line (beyond calling it in a shell)? libbfd looks crazy hard to use.

jeffhammond commented 8 years ago

We should create a new issue for backtrace support and leave it as a long-term, low-priority feature request. I'll try to look into it some day, but I suspect I will have three kids before it happens ;-)

solomonik commented 8 years ago

I have the below stack trace print thing,

https://github.com/solomonik/ctf/blob/master/src/interface/common.cxx#L156

which used to get called in ASSERT, but I stopped using it, because it does not work anywhere except simple linux boxes where you can always run valgrind anyway.

Also at the moment, CTF does not really support BG/Q anymore.

On Mon, Jan 11, 2016 at 8:33 PM, Jeff Hammond notifications@github.com wrote:

We should create a new issue for backtrace support and leave it as a long-term, low-priority feature request. I'll try to look into it some day, but I suspect I will have three kids before it happens ;-)

— Reply to this email directly or view it on GitHub https://github.com/solomonik/ctf/issues/15#issuecomment-170664890.

solomonik commented 8 years ago

Ok I reactivated the stack trace printing as well as added a print message that shows what and where the ASSERT is. However, this info is only printed on processor 0, which means even if all processors fail there is no guarantee that the error will be printed. This is one of the reasons I had taken this stuff out some time ago. Printing errors on all processors becomes problematic when you start running on a few thousand of them.

Generally -DDEBUG and ASSERT() functionality is designed for people changing CTF and wanting to debug the resulting code, not for users of CTF. Indeed when testing new functionality on top of CTF, they can help detect CTF bugs, but I think cases like the highly intricate error you ran into when running CCSDTQ in parallel are quite exceptional.