darshan-hpc / darshan

Darshan I/O characterization tool
Other
56 stars 27 forks source link

BUG, MAINT: log with negative heatmap bin values (from Slack) #943

Open tylerjereddy opened 1 year ago

tylerjereddy commented 1 year ago

This log was provided by Nafise Moti on Slack, with debug prints in the darshan/datatypes/heatmap.py code suggesting we get negative heatmap bin values with print(rec['write_bins']). moti_nek5000.tar.gz

I asked to see if they could regenerate the log with the latest darshan-runtime, in case it is simply an issue with the older runtime. If it is reproducible with the newer runtime, then it may be interesting to diagnose on the runtime side.

However, the initial log above is probably enough for us to start studying for evidence of i.e., overflow to produce large negative numbers, and to see if the darshan-parser agrees with PyDarshan bindings perhaps.

carns commented 1 year ago

It looks like there may be another problem with this log (either something orthogonal, or possibly the root cause of record corruption elsewhere). My first attempt to parse the log manually using a current origin/main build produced the following:

carns-x1-7g ~/w/d/darshan-nafiseh-negative-heatmap-2023-06> ~/working/install/bin/darshan-parser moti_nek5000_id11006669-50626_5-31-893-3074083273138734090_1.darshan |grep HEAT
# HEATMAP module: 172850 bytes (compressed), ver=1
AddressSanitizer:DEADLYSIGNAL
=================================================================
==625539==ERROR: AddressSanitizer: SEGV on unknown address 0x632000050000 (pc 0x7fc67814bda6 bp 0x63200004ffff sp 0x7ffdae52b858 T0)
==625539==The signal is caused by a WRITE memory access.
    #0 0x7fc67814bda6  (/lib/x86_64-linux-gnu/libz.so.1+0x3da6) (BuildId: e25bfc3865cecab5886349610fb620ce21758d11)
    #1 0x7fc6781577de in inflate (/lib/x86_64-linux-gnu/libz.so.1+0xf7de) (BuildId: e25bfc3865cecab5886349610fb620ce21758d11)
    #2 0x7fc6781a9b2f in darshan_log_libz_read ../../darshan-util/darshan-logutils.c:1681
    #3 0x7fc6781a934c in darshan_log_dzread ../../darshan-util/darshan-logutils.c:1567
    #4 0x7fc67819ff5b in darshan_log_get_mod ../../darshan-util/darshan-logutils.c:789
    #5 0x7fc6781d22b8 in darshan_log_get_lustre_record ../../darshan-util/darshan-lustre-logutils.c:95
    #6 0x557a1d9ad4a4 in main ../../darshan-util/darshan-parser.c:395
    #7 0x7fc677023a8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #8 0x7fc677023b48 in __libc_start_main_impl ../csu/libc-start.c:360
    #9 0x557a1d9ab604 in _start (/home/carns/working/install/bin/darshan-parser+0x4604) (BuildId: 56758761c6dea4840907f84f13af5d89cc9b1f3e)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/lib/x86_64-linux-gnu/libz.so.1+0x3da6) (BuildId: e25bfc3865cecab5886349610fb620ce21758d11) 
==625539==ABORTING

My Darshan build is using gcc 12.2 and address sanitizer. Address sanitizer is triggering a segfault while darshan-parser is extracting Lustre module data, which happens before it gets to the heatmap module.

I can't validate yet, but it's possible that memory corruption is leading to the bogus heatmap data. I'll try hacking up the parser to extract only the heatmap data and see what it looks like.

carns commented 1 year ago

Ok, that was a red herring. When I hacked up darshan-parser to skip straight to the heatmap module I definitely see negative values:

carns-x1-7g ~/w/d/darshan-nafiseh-negative-heatmap-2023-06> ~/working/install/bin/darshan-parser moti_nek5000_id11006669-50626_5-31-893-3074083273138734090_1.darshan |grep HEATMAP |grep -v "^#" | cut -f 5 |sort -n |uniq -c |head -n 20
      8 -9223372036854775808
      4 -9223372036854774784
      7 -9223372036854773760
      1 -9223372036854772736
      3 -9223372036854771712
      3 -9223372036854769664
      1 -9223372036854767616
      1 -9223372036854765568
      1 -9223372036854764544
      2 -9223372036854762496
      1 -9223372036854761472
      1 -9223372036854758400
 527058 0
   2049 0.400000
      1 11
      1 26
      1 29
      1 32
      1 74
      1 130

It looks like a little over 30 bins have negative values.

carns commented 1 year ago

I tried this little test program to see what the max value of an int64 counter is:

carns-x1-7g ~/tmp> cat foo.c                                      e: darshan-dev
#include <stdint.h>
#include <limits.h>
#include <stdio.h>

int main(int argc, char **argv) {
    printf("int64 max: %ld\n", INT64_MAX);
}

carns-x1-7g ~/tmp> ./foo                                          e: darshan-dev
int64 max: 9223372036854775807

The negative values here look like they are very close to the most negative values possible on a signed 64 bit integer. It's impossible for the application to have accessed enough data to simply wrap around though. This has to be a bug in Darshan or some sort of corner case in the function arguments we are tracking.

carns commented 1 year ago

Another piece of the puzzle: all of these faulty values are in the stdio heatmap:

> ~/working/install/bin/darshan-parser moti_nek5000_id11006669-50626_5-31-893-3074083273138734090_1.darshan |grep HEATMAP |grep -v "^#" | grep "-"
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_1 -9223372036854775808    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_110   -9223372036854775808    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_112   -9223372036854773760    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_113   -9223372036854769664    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_114   -9223372036854765568    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_115   -9223372036854758400    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_116   -9223372036854764544    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_117   -9223372036854774784    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_118   -9223372036854771712    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_119   -9223372036854761472    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_120   -9223372036854773760    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_121   -9223372036854769664    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_122   -9223372036854771712    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_123   -9223372036854762496    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_124   -9223372036854767616    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_125   -9223372036854774784    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_126   -9223372036854773760    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_127   -9223372036854772736    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_128   -9223372036854775808    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_129   -9223372036854769664    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_130   -9223372036854762496    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_131   -9223372036854774784    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_132   -9223372036854773760    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_133   -9223372036854775808    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_134   -9223372036854775808    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_135   -9223372036854775808    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_136   -9223372036854773760    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_137   -9223372036854773760    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_138   -9223372036854775808    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_139   -9223372036854773760    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_140   -9223372036854774784    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_141   -9223372036854771712    heatmap:STDIO   UNKNOWN UNKNOWN
HEATMAP 0   3989511027826779520 HEATMAP_WRITE_BIN_142   -9223372036854775808    heatmap:STDIO   UNKNOWN UNKNOWN
carns commented 1 year ago

I think we might need a runtime reproducer to track this down. I'd like to see what the arguments look like on the way in to the heatmap_update() calls. Feels like a bad cast somewhere.

carns commented 1 year ago

Nafiseh also provided a log from a different run that does not exhibit this problem (this is again with the parser hacked to only show the heatmap data):

carns-x1-7g ~/w/d/darshan-nafiseh-negative-heatmap-2023-06> ~/working/install/bin/darshan-parser moti_nek5000_id11407458-51659_8-29-42872-3372088521547566391_5.darshan |grep -v "^#" | grep "-"
carns-x1-7g ~/w/d/darshan-nafiseh-negative-heatmap-2023-06 [0|0|1]> 

So this is possibly an application problem, but it would still be good to know if there was something we could have done to prevent bogus data from getting to the log or if it was some sort of uncontrollable problem (like a memory corruption).