darshan-hpc / darshan

Darshan I/O characterization tool
Other
56 stars 28 forks source link

Darshan-util fails with poorly formatted mount table #995

Closed bellenlau closed 3 months ago

bellenlau commented 4 months ago

Hello,

darhsan-util clis fail with the following message, starting from darshan-parser

Error: poorly formatted mount table in darshan log file.

I am instrumenting an MPI application using intel-oneapi-mpi/2021.10.0 with darshan-runtime/3.4.5. The logfile is generated without error and the darshan-parser works on logfiles generated by instrumenting the same application using openmpi.

Thank you for your help,

Laura

shanedsnyder commented 4 months ago

Can you comment on how this relates to the other issue you opened (#994)? You mention in the other issue that some applications work fine with Darshan, but perhaps you ultimately ran into this issue when parsing? Or is this an entirely different application than the ones mentioned in the original issue?

bellenlau commented 4 months ago

Hello, this does not realte to the other issue. I am testing two different installations and applications. The software stack is always based on intel-oneapi-mpi, but in the case of this issue I installed darshan/3.4.5 manually and the application tested is a toy code doing some MPI-IO, not a scientific application. In the case of this issue, theres should be no mismatch between the mpi used to configure darshan-runtime and to compile the miniapp.

After further investigation, I noticed that the issue arise when I run the program on two different partitions of the cluster. The simulations are launched from the same area of the cluster, but run in compute nodes from different partitions. Any suggestion regarding how this could affect the measurement?

shanedsnyder commented 4 months ago

Okay, thanks for the details.

I can't think of anything obvious after having a closer look at the code that writes/reads that portion of the log file you're getting errors about (the mount table). That portion of the log file is a fixed length, so I was initially wondering if the error is somehow related to this fixed-length memory being exhausted. I hacked Darshan to use a much smaller size mount table log region, but it doesn't seem to ever give me any errors like you're getting.

Is it possible to share the output of the mount command on the nodes that Darshan gives this error on? I don't know how it would happen, but maybe it's formatted in a way that confuses Darshan.

bellenlau commented 4 months ago

Yes, please find the output attached. I can share also the darshan logfile if needed.

/dev/mapper/live-rw on / type ext4 (rw,relatime)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
none on /sys/kernel/tracing type tracefs (rw,relatime)
configfs on /sys/kernel/config type configfs (rw,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
tracefs on /sys/kernel/debug/tracing type tracefs (rw,relatime)
devtmpfs on /dev type devtmpfs (rw,nosuid,size=263690608k,nr_inodes=65922652,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
mqueue on /dev/mqueue type mqueue (rw,relatime)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=40,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=13134)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,size=5242880k)
/dev/nvme0n1p1 on /nvme type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,prjquota)
10.128.82.1@o2ib:10.128.82.2@o2ib:/larchive/work on /leonardo_work type lustre (rw,flock,lazystatfs,noencrypt)
10.128.80.1@o2ib:10.128.80.2@o2ib:/lhomefs/leonardo on /leonardo type lustre (rw,flock,lazystatfs,noencrypt)
10.128.84.1@o2ib:10.128.84.2@o2ib:/lscratch/leonardo on /leonardo_scratch type lustre (rw,flock,lazystatfs,noencrypt)
10.128.80.1@o2ib:10.128.80.2@o2ib:/lhomefs/leonardo on /opt/cuda type lustre (rw,flock,lazystatfs,noencrypt)
10.128.80.1@o2ib:10.128.80.2@o2ib:/lhomefs/leonardo on /opt/intel type lustre (rw,flock,lazystatfs,noencrypt)
10.128.80.1@o2ib:10.128.80.2@o2ib:/lhomefs/leonardo on /etc/nhc type lustre (rw,flock,lazystatfs,noencrypt)
10.128.80.1@o2ib:10.128.80.2@o2ib:/lhomefs/leonardo on /home type lustre (rw,flock,lazystatfs,noencrypt)
/dev/nvme0n1p1 on /tmp type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,prjquota)
tmpfs on /dev/shm type tmpfs (rw,relatime)
shanedsnyder commented 4 months ago

Yeah, if you don't mind sharing the log file, that would be helpful, too. I don't see anything obvious in the above mount output, but maybe stepping through how it's parsing this info in a debugger will help me find out what's going on.

bellenlau commented 4 months ago

The logfile is attached, thank you for your help. darsh-s8000000000-np16.log

bellenlau commented 3 months ago

I add this piece of information: the report can be generated with Pydarshan; the same error pops up but the html does not seem to present issues.

Thank you,

Laura

carns commented 3 months ago

I found some confusing behavior when trying to reproduce this just now.

If I build Darshan by hand (from origin/main, though I don't think the version is really a factor) then I cannot trigger the problem:

carns-x1-11g ~/w/d/darshan-issue-995-mount-table> ~/working/install/bin/darshan-parser darsh-s8000000000-np16.darshan |head
# darshan log version: 3.41
# compression method: ZLIB
# exe: ./misterio.exe -n misterio.out -s 8000000000
# uid: 33678
# jobid: 1099726
# start_time: 1721201978
# start_time_asci: Wed Jul 17 03:39:38 2024
# end_time: 1721201984
# end_time_asci: Wed Jul 17 03:39:44 2024
# nprocs: 16

... but if I build the same version of Darshan (darshan-util@main) in Spack, then I get the error that you are reporting:

carns-x1-11g ~/w/d/darshan-issue-995-mount-table> darshan-parser darsh-s8000000000-np16.darshan |head
Error: poorly formatted mount table in darshan log file.

I haven't been able to work out what the key difference is, though. I've tried to put the spack-built libzlib-ng and libdarshan-util in the library path for my manual build, but they seem fine there. There is something subtle different about the environment that is causing a problem.

@bellenlau are you building in Spack as well? If so can you share what ldd of your darshan-parser looks like? Mine looks like this:

carns-x1-11g ~/w/d/darshan-issue-995-mount-table> which darshan-parser
/home/carns/working/src/spack/var/spack/environments/mochi-bedrock/.spack-env/view/bin/darshan-parser
carns-x1-11g ~/w/d/darshan-issue-995-mount-table> ldd /home/carns/working/src/spack/var/spack/environments/mochi-bedrock/.spack-env/view/bin/darshan-parser
    linux-vdso.so.1 (0x00007ffe065de000)
    libdarshan-util.so.0 => /home/carns/working/src/spack/opt/spack/linux-ubuntu24.04-skylake/gcc-13.2.0/darshan-util-main-dzh4l6vagqjy7hlfgymov7vctkbjozja/lib/libdarshan-util.so.0 (0x00007936e42a5000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007936e4000000)
    libz.so.1 => /home/carns/working/src/spack/opt/spack/linux-ubuntu24.04-skylake/gcc-13.2.0/zlib-ng-2.2.1-zqgjz4orlscjsrb2zc43q4y2oi4myktz/lib/libz.so.1 (0x00007936e4269000)
    /lib64/ld-linux-x86-64.so.2 (0x00007936e42d3000)
carns commented 3 months ago

I'm trying to catch the spack-built darshan-util in a debugger to see what the table looks like that it is trying to parse in darshan_log_get_mounts(), but I'm having a hard time getting a build out of it that includes debugging symbols without optimizations. I'm trying this but its obviously still getting -O2 so I can't figure out much:

spack add darshan-util@main cppflags="-g3 -O0" cflags="-g3 -O0"
bellenlau commented 3 months ago

Hello,

thank you for the insight. I checked with a manually installed darshan and indeed the issue does not arise. Here is the output of ldd

[lbellen1@login02 misterio]$ ldd /leonardo/pub/userinternal/lbellen1/spack-0.21.0-05/install/linux-rhel8-icelake/gcc-8.5.0/darshan-util-3.4.5-vibaclfthya6ury7q6omqgdinzdx35o4/bin/darshan-parser
        linux-vdso.so.1 (0x00007fffb8dcc000)
        libdarshan-util.so.0 => /leonardo/pub/userinternal/lbellen1/spack-0.21.0-05/install/linux-rhel8-icelake/gcc-8.5.0/darshan-util-3.4.5-vibaclfthya6ury7q6omqgdinzdx35o4/lib/libdarshan-util.so.0 (0x00007f1283813000)
        libbz2.so.1 => /lib64/libbz2.so.1 (0x00007f1283602000)
        libz.so.1 => /leonardo/prod/spack/05/install/0.21/linux-rhel8-icelake/gcc-8.5.0/zlib-ng-2.1.4-6htiapkoa6fx2medhyabzo575skozuir/lib/libz.so.1 (0x00007f12833dc000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f1283006000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f1283a35000)
bellenlau commented 3 months ago

Hello,

I tried building with spack but by forcing ^zlib instead of ^zlib-ng and the error does not arise; the issue seems to be triggered by zlib-ng. If you prefer I can report in on the spack github.

Thank you,

Laura

carns commented 3 months ago

Oh, interesting. Thanks for finding that!

Let's debug just a little further to see what exactly the difference is before we report a problem with zlib-ng. It's possible that darshan-utils has a subtle error (like with a string terminator or something) that just happens to work Ok with the original zlib. Or if not we can maybe make a simpler reproducer.

carns commented 3 months ago

FWIW here is a bug report on not being able to build spack packages with debug symbols: https://github.com/spack/spack/issues/45507

In this case I'd like to build darshan-util with debug symbols to confirm what the buffer looks in the Spack build, but either it's broken or I'm missing something.

carns commented 3 months ago

@bellenlau , was the runtime darshan library also linked against zlib-ng when this log was written?

We figured out what was making it difficult to debug in spack so it's easier to look at the problem now, but the trigger isn't clear. The decompressor is getting a few extra characters out of the Darshan log when it reads the mount table. There are things that could be done on the log writing side to make the data itself resilient to this kind of problem, but I'd prefer to understand the root cause rather than inserting a workaround.

carns commented 3 months ago

Actually it fails on some logs in our example repository too (but not all of them), but they were definitely written with the conventional zlib implementation. This one triggers the problem for example:

https://github.com/darshan-hpc/darshan-logs/blob/main/darshan_logs/release_logs/mpi-io-test-x86_64-3.4.3.darshan

But this one does not:

https://github.com/darshan-hpc/darshan-logs/blob/main/darshan_logs/release_logs/mpi-io-test-x86_64-3.4.4.darshan

carns commented 3 months ago

Ok, I isolated the zlib-ng behavior that is causing a problem. Sometimes it modifies bytes in the decompression buffer beyond the number of bytes it indicated that it produced. In the Darshan mount table parsing specifically this leads to corrupted strings in the final entry.

I've got a PR opened that works around the problem (by explicitly zeroing unused bytes after decompression) but we need to make sure it passes CI.

I suspect that we can make a simpler reproducer to report the bug at zlib-ng; this code path in Darshan is a little complex.