StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
657 stars 146 forks source link

abort() does not flush IO #1693

Open suranap opened 1 month ago

suranap commented 1 month ago

I ran into this because I didn't get the error message shown here in my logs. Instead, the debugger showed this line in the stack dump. Turns out abort does not flush IO. exit() is a nicer way to end a process, and it will usually flush IO.

https://github.com/StanfordLegion/legion/blob/c61071541218747e35767317f6f89b83f374f264/runtime/legion/runtime.cc#L32128-L32130

elliottslaughter commented 1 month ago

If I recall, we do not use exit because it does not catch in a debugger, making it harder to catch at the exact point of failure.

We could use a wrapper around abort that flushes first.

lightsighter commented 1 month ago

Are you referring to Realm logger output or stdout/stderr output when you say it is not being flushed?

suranap commented 1 month ago

@lightsighter The logs are stdout/err from srun —output filename. I get a message in the logs the process has aborted, but no log message about why.

@elliottslaughter gdb has catch syscall exit exit_group? Not sure if that’s what you’re looking for. In addition, you can register a stack trace handler with atexit to print stacks when exiting with failure. Sadly on_exit is GNU only, so you’d need a global variable to store the exit status code.

lightsighter commented 1 month ago

@lightsighter The logs are stdout/err from srun —output filename. I get a message in the logs the process has aborted, but no log message about why.

What is the exit code? FWIW, Realm error message will be flushed on abort even if stdout/stderr are not. All Legion and Realm error messages will be reported that way and will show up in the logging files you create with the -logfile option.

suranap commented 1 month ago

The exit code is signal 6.

Also, how big can a future be? FlexFlow creates 4k futures. I increased to 16k for reasons and hit this error.

Below is more context. I set it to freeze on error and print stack trace. FYI: this is a year old version of Legion.

Last few lines of my log from srun --output filename ...:

Input_2000597 num_replicas(1) volume(3145728) dims(6144, 512, )
Index load entire dataset: ptr input_head_ 0x7fe9ef8cd010, idx 0, input_ 0x7fe9ef8cd010 140642723090448 30720000, input 0x7ff028600000 140669446258688 30720000
2 4 1694 1693 4 4 1693 3653 1698 2324 5801 2030 2034 3116 1738 2334 2489 4 18 7518 17145 1703 7102 9683 24281 1714 1777 14291 2440 1761 2609 1818
numpy array: (140642600206352, False), <cdata 'float *' 0x7fe9e839c010>, 0x7fe9e839c010
Index load entire dataset: ptr input_head_ 0x7fe9e839c010, idx 0, input_ 0x7fe9e839c010 140642600206352 30720000, input 0x7ff02fb30000 140669569138688 30720000
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
numpy array: (140662590826368, False), <cdata 'float *' 0x7fee8fc26780>, 0x7fee8fc26780
Index load entire dataset: ptr input_head_ 0x7fee8fc26780, idx 0, input_ 0x7fee8fc26780 140662590826368 512, input 0x7ff037060000 140669692018688 512
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
numpy array: (140662591049088, False), <cdata 'float *' 0x7fee8fc5cd80>, 0x7fee8fc5cd80
Index load entire dataset: ptr input_head_ 0x7fee8fc5cd80, idx 0, input_ 0x7fee8fc5cd80 140662591049088 512, input 0x7ff037060800 140669692020736 512
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Numpy array: (140639692714000, False), <cdata 'float *' 0x7fe93aecf010>, 0x7fe93aecf010
[0 - 7fee98bf5380]  472.641366 {3}{flexflow_c}: [Tensor] get dims [0, 6144, 512, 1]
Index load entire dataset: ptr input_head_ 0x7fe93aecf010, idx 0, input_ 0x7fe93aecf010 140639692714000 30720000, input 0x7ff037061000 140669692022784 30720000
-100 4951 -100 -100 4697 1698 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 8106 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100
Legion process received signal 6: Aborted
Process 32864 on node frontier05995 is frozen!
+ gdb -p 32864 --batch -ex 'thread apply all bt'
Missing separate debuginfo for /opt/rocm-5.6.0/lib/libamdhip64.so.5
Try: zypper install -C "debuginfo(build-id)=3e24136391633ee5e26e3e3011e419adc8b89023"
Missing separate debuginfo for /opt/rocm-5.6.0/lib/libamd_comgr.so.2
Try: zypper install -C "debuginfo(build-id)=5980d7b00e9b762dab58e806e819239acabc10ec"
Missing separate debuginfo for /opt/rocm-5.6.0/lib/libhsa-runtime64.so.1
Try: zypper install -C "debuginfo(build-id)=fcf65b97b7007f6218396ee700939aa332f68317"
Missing separate debuginfo for /opt/rocm-5.6.0/lib/librocm_smi64.so.5
Try: zypper install -C "debuginfo(build-id)=0a634656f0f269d2ca1cf4cbfe32ed09ecf91eef"

Here's the relevant stack with nanosleep:

Thread 6 (Thread 0x7fee9cdf5380 (LWP 100831) "legion_python"):
#0  0x00007fffe4fb6121 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
#1  0x00007fffe4fbbe43 in nanosleep () from /lib64/libc.so.6
#2  0x00007fffe4fbbd5a in sleep () from /lib64/libc.so.6
#3  0x00007fffe676b06a in Realm::realm_freeze (signal=6) at /ccs/home/suranap/FRONTIER/projects/Bert_big_merge/deps/legion/runtime/realm/runtime_impl.cc:187
#4  <signal handler called>
#5  0x00007fffe4f24d2b in raise () from /lib64/libc.so.6
#6  0x00007fffe4f263e5 in abort () from /lib64/libc.so.6
#7  0x00007fffea1249ee in Legion::Internal::Runtime::report_error_message (id=576, file_name=<optimized out>, line=10616, message=<optimized out>) at /ccs/home/suranap/FRONTIER/projects/Bert_big_merge/deps/legion/runtime/legion/runtime.cc:31799
#8  Legion::Internal::MemoryManager::create_future_instance (this=0x7fee8fd6c780, op=0x7fee8408e870, creator_uid=161, ready_event=..., size=67140, eager=true) at /ccs/home/suranap/FRONTIER/projects/Bert_big_merge/deps/legion/runtime/legion/runtime.cc:10609
#9  0x00007fffe9d1795f in Legion::Internal::TaskContext::copy_to_future_inst (this=this@entry=0x7fee84074680, value=value@entry=0x7fee9cde1980, size=size@entry=67140, done=...) at /ccs/home/suranap/FRONTIER/projects/Bert_big_merge/deps/legion/runtime/legion/legion_\
context.cc:2331
#10 0x00007fffe9d1782b in Legion::Internal::TaskContext::from_value (this=0x7fee84074680, value=0x7fee9cde1980, size=67140, owned=false, provenance=0x0) at /ccs/home/suranap/FRONTIER/projects/Bert_big_merge/deps/legion/runtime/legion/legion_context.cc:182
#11 0x00007fffe9d51ad8 in Legion::Internal::ReplicateContext::from_value (this=0x7fee9cdde070, value=0x0, size=67140, owned=false, provenance=0x7fee9cdde070) at /ccs/home/suranap/FRONTIER/projects/Bert_big_merge/deps/legion/runtime/legion/legion_context.cc:12249
#12 0x00007fffe9e1463f in Legion::Future::from_untyped_pointer (value=0x7fee9cde1980, value_size=67140, owned=false, prov=0x0) at /ccs/home/suranap/FRONTIER/projects/Bert_big_merge/deps/legion/runtime/legion/legion.cc:2504
#13 0x00007fffea11ca63 in Legion::Internal::ArgumentMapImpl::set_point (this=0x7fed5cfcc980, point=..., arg=..., replace=true) at /ccs/home/suranap/FRONTIER/projects/Bert_big_merge/deps/legion/runtime/legion/runtime.cc:253
#14 0x00007fee59dd84d1 in SingleDataLoader::next_batch_xd_launcher<3> (this=0x7fed5d548a90, ff=..., task_id=120) at /ccs/home/suranap/FRONTIER/projects/Bert_big_merge/src/dataloader/dataloader.cc:250
#15 0x00007fee59dd614e in SingleDataLoader::SingleDataLoader (this=0x7fed5d548a90, ff=..., input=<optimized out>, full_input_ptr=0x7fea14acf010, num_samples_=<optimized out>, datatype_=<optimized out>) at /ccs/home/suranap/FRONTIER/projects/Bert_big_merge/src/datal\
oader/dataloader.cc:142
#16 0x00007fee59dd2583 in flexflow_single_dataloader_create2 (ffmodel_=..., input_=..., full_input_ptr=0x7fea14acf010, num_samples=60000, data_type=DT_INT32) at /ccs/home/suranap/FRONTIER/projects/Bert_big_merge/src/c/flexflow_c.cc:1698
elliottslaughter commented 1 month ago

If I understand correctly, I think @lightsighter is suggesting that you use -logfile to direct log output to a file instead of the --output flag for srun (and that the messages will be flushed if you do).

For what it's worth, srun does all sorts of strange things with output and I've gotten weird results even when I've run srun ... stdbuf -o0 ... to make sure stdout is unbuffered.

lightsighter commented 1 month ago

8 Legion::Internal::MemoryManager::create_future_instance (this=0x7fee8fd6c780, op=0x7fee8408e870, creator_uid=161, ready_event=..., size=67140, eager=true) at /ccs/home/suranap/FRONTIER/projects/Bert_big_merge/deps/legion/runtime/legion/runtime.cc:10609

The error messages in that function haven't changed very much (if at all) in the past year. The problem isn't that you've made the future "too big" (Legion let's you make futures as large as you want with the right settings, and the error messages if you mess it up won't happen in this function). Instead you're actually out memory, in this case in the eager memory pool, which you'll need to increase with the -lg:eager_alloc_percentage flag. (In the next month or so you won't have to do this, but for now you do.)

opensdh commented 1 month ago

You can just use std::fflush(nullptr); to have the effect that exit has on streams without changing anything else (like producing a core dump).

elliottslaughter commented 1 month ago

If stream is a null pointer, all open output streams are flushed, including the ones manipulated within library packages or otherwise not directly accessible to the program.

https://en.cppreference.com/w/cpp/io/c/fflush

@lightsighter @eddy16112 Seems worth using anywhere we call abort()?

suranap commented 3 weeks ago

I think @lightsighter is suggesting that you use -logfile to direct log output to a file instead of the --output flag for srun (and that the messages will be flushed if you do).

I'm forced to run 8 legion instances/node on Frontier (because MIOpen crawls if I run 1 instance per node). The docs say

The -logfile flag can be used to specify a filename for the logs. A % character in the name (if any) will be replaced by the node number (resulting in one log file per node).

I need the logfile name to expand to node number AND process ID.

elliottslaughter commented 3 weeks ago

When the documentation says "node" it really means rank. You will get one log file per rank.