StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
675 stars 145 forks source link

profiler crash on 8-node S3D profile #1675

Closed rohany closed 5 months ago

rohany commented 5 months ago

I'm seeing the following profiler crash on an 8-node run of S3D. For some reason, valid profiles are generated at all other node counts (1->16).

(nersc-python) rohany@perlmutter:login23:/pscratch/sd/r/rohany/s3d_perlmutter> RUST_BACKTRACE=full legion_prof ./sweeptest/128x64x64/auto-trace/pwave_x_8_hept/run/prof_0.gz -o test
Reading log file "./sweeptest/128x64x64/auto-trace/pwave_x_8_hept/run/prof_0.gz"...
thread 'main' panicked at /pscratch/sd/r/rohany/legion_s3d/legion/tools/legion_prof_rs/src/spy/serialize.rs:587:37:
called `Result::unwrap()` on an `Err` value: Error(Error { input: "", code: CrLf })
stack backtrace:
   0:     0x55d2adfd8496 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h9cca0343d66d16a8
   1:     0x55d2ae000370 - core::fmt::write::h4311bce0ee536615
   2:     0x55d2adfd5c6f - std::io::Write::write_fmt::h0685c51539d0a0cd
   3:     0x55d2adfd8274 - std::sys_common::backtrace::print::h2fb8f70628a241ed
   4:     0x55d2adfd9af7 - std::panicking::default_hook::{{closure}}::h05093fe2e3ef454d
   5:     0x55d2adfd9859 - std::panicking::default_hook::h5ac38aa38e0086d2
   6:     0x55d2adfd9f88 - std::panicking::rust_panic_with_hook::hed79743dc8b4b969
   7:     0x55d2adfd9e62 - std::panicking::begin_panic_handler::{{closure}}::ha437b5d58f431abf
   8:     0x55d2adfd8996 - std::sys_common::backtrace::__rust_end_short_backtrace::hd98e82d5b39ec859
   9:     0x55d2adfd9bb4 - rust_begin_unwind
  10:     0x55d2add87765 - core::panicking::panic_fmt::hc69c4d258fe11477
  11:     0x55d2add87c53 - core::result::unwrap_failed::hff299ec748d62aab
  12:     0x55d2addbb2ef - legion_prof::spy::serialize::deserialize::h56b3bab62280837c
  13:     0x55d2addcfe5b - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once::hcd60ee582f9cd211
  14:     0x55d2addb2597 - <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend::hadd5946032273160
  15:     0x55d2addccbad - rayon::iter::plumbing::Producer::fold_with::h2df3b665c549b02f
  16:     0x55d2add9e448 - rayon::iter::plumbing::bridge_producer_consumer::helper::h7229160a9354d8f6
  17:     0x55d2addb197e - rayon::iter::extend::<impl rayon::iter::ParallelExtend<T> for alloc::vec::Vec<T>>::par_extend::h6237cfef6dae8bc0
  18:     0x55d2addb1850 - rayon::iter::from_par_iter::<impl rayon::iter::FromParallelIterator<T> for alloc::vec::Vec<T>>::from_par_iter::h0f8c38eb4d0bd1b6
  19:     0x55d2addcd7f0 - rayon::result::<impl rayon::iter::FromParallelIterator<core::result::Result<T,E>> for core::result::Result<C,E>>::from_par_iter::h1454c8c6073271ed
  20:     0x55d2addd3d16 - legion_prof::main::h409a5eb431bab0b8
  21:     0x55d2addc59b3 - std::sys_common::backtrace::__rust_begin_short_backtrace::h7836657c01cdd8be
  22:     0x55d2addc59cd - std::rt::lang_start::{{closure}}::h8e84562a5ce98348
  23:     0x55d2adfcec81 - std::rt::lang_start_internal::hdaf8b62dc8f7de54
  24:     0x55d2adddddd5 - main
  25:     0x7f2063cc724d - __libc_start_main
  26:     0x55d2add87f2a - _start
  27:                0x0 - <unknown>

The profile in question is at rohany@sapling2.stanford.edu:~/broken-s3d-profile.gz. I am on legion branch automatic-tracing-non-idempotent-traces, which is branched off of main from a few days ago.

cc @elliottslaughter @eddy16112

elliottslaughter commented 5 months ago
$ ls -l ~rohany/broken-s3d-profile.gz 
-rw-rw---- 1 rohany rohany 0 Apr  3 20:02 /home/rohany/broken-s3d-profile.gz

No permissions.

rohany commented 5 months ago

try now

elliottslaughter commented 5 months ago

The file seems to be empty?

$ wc -c ~rohany/broken-s3d-profile.gz 
0 /home/rohany/broken-s3d-profile.gz
rohany commented 5 months ago

sorry for the false alarm, i can't read my logs correctly ...