StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
678 stars 145 forks source link

HTR crash on multiple nodes #1573

Closed cmelone closed 9 months ago

cmelone commented 11 months ago

Latest control_replication. 4 ranks, 1 rank per node with GPUs. This is non-deterministic only with specific test cases of the solver.

I think this is the same crash as #1415. Feel free to move this back there if you'd prefer.

crash:

legion/runtime/realm/transfer/transfer.cc:161: bool Realm::TransferIteratorBase<N, T>::done() [with int N = 1; T = long long int]: Assertion `inst_impl->metadata.is_valid()' failed.

backtrace:

(gdb) bt
#0  0x00007f88fc82f9fd in nanosleep () from /lib64/libc.so.6
#1  0x00007f88fc82f894 in sleep () from /lib64/libc.so.6
#2  0x00007f88fe13f162 in Realm::realm_freeze (signal=<optimized out>) at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/runtime_impl.cc:200
#3  <signal handler called>
#4  0x00007f88fc7a0387 in raise () from /lib64/libc.so.6
#5  0x00007f88fc7a1a78 in abort () from /lib64/libc.so.6
#6  0x00007f88fc7991a6 in __assert_fail_base () from /lib64/libc.so.6
#7  0x00007f88fc799252 in __assert_fail () from /lib64/libc.so.6
#8  0x00007f88fe17878c in Realm::TransferIteratorBase<1, long long>::done (this=0x7f88b83c6bc0) at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/metadata.h:43
#9  0x00007f88fe18df0a in Realm::TransferIteratorBase<1, long long>::get_addresses (this=0x7f88b83c6bc0, addrlist=..., nonaffine=@0x7f88797ff398: 0x0)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/transfer/transfer.cc:523
#10 0x00007f88fe1f833c in Realm::XferDes::get_addresses (this=0x7f88b83d4200, min_xfer_size=8, rseqcache=<optimized out>, in_nonaffine=@0x7f88797ff390: 0x0, out_nonaffine=@0x7f88797ff398: 0x0)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/transfer/channel.cc:1509
#11 0x00007f88fe1f8471 in Realm::XferDes::get_addresses (this=<optimized out>, min_xfer_size=<optimized out>, rseqcache=<optimized out>)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/transfer/channel.cc:1430
#12 0x00007f88fe1f9a0a in Realm::MemreduceXferDes::progress_xd(Realm::MemreduceChannel*, Realm::TimeLimit) ()
    at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/transfer/channel.cc:3472
#13 0x00007f88fe206521 in Realm::XDQueue<Realm::MemreduceChannel, Realm::MemreduceXferDes>::do_work (this=0x3d3dd38, work_until=...)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/transfer/channel.inl:53
#14 0x00007f88fe15ca20 in Realm::BackgroundWorkManager::Worker::do_work(long long, Realm::atomic<bool>*) () at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/timers.inl:288
#15 0x00007f88fe233087 in wait_for_work (old_work_counter=<optimized out>, this=0x3c1c800) at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/tasks.cc:1291
#16 Realm::ThreadedTaskScheduler::wait_for_work (this=0x3c1c800, old_work_counter=<optimized out>) at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/tasks.cc:1275
#17 0x00007f88fe2387d3 in Realm::ThreadedTaskScheduler::scheduler_loop() () at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/tasks.cc:1260
#18 0x00007f88fe21f4cf in Realm::UserThread::uthread_entry() () at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/threads.cc:1355
#19 0x00007f88fc7b2190 in ?? () from /lib64/libc.so.6
#20 0x0000000000000000 in ?? ()

@elliottslaughter, please add to #1032, thanks!

lightsighter commented 11 months ago

Please report the name of the instance that has failed to have its metadata (ask someone from the Realm team how to do this) and the logs with -level inst=1 from a failed run.

apryakhin commented 11 months ago

@lightsighter I can help with debugging this

lightsighter commented 11 months ago

I suspect it is likely to be a Legion issue if we're deleting an instance before it is done being used, but in order to understand that I need to know the name of the instance and what is was used for (region/future/etc) which requires the name of the instance and logging information from a failed run.

apryakhin commented 11 months ago

@cmelone I think we need to run this with the debug option -level xplan=1,inst=1 xplan will always print a transfer graph which would include all related instances. From that I think we should be able to derive what instance are we talking about here. @lightsighter Forgive me - what are you referring to by instance name? RegionInstance provides a standard serdez when logged to an output stream, so I am hoping that would be enough?...e.g. (this are not our instances)

  3 [0 - 7f1714317c80]    0.626858 {1}{xplan}: created: plan=0x7ef76e6aa120 srcs[0]=field(103, inst=400000000080000f, size=4)
  4 [0 - 7f1714317c80]    0.626860 {1}{xplan}: created: plan=0x7ef76e6aa120 dsts[0]=field(103, inst=4000000000800030, size=4)
streichler commented 11 months ago

@apryakhin I think the idea was to replace this line: https://github.com/StanfordLegion/legion/blob/stable/runtime/realm/transfer/transfer.cc#L161 with something more directly helpful, e.g.:

if(!inst_impl->metadata.is_valid())
  log_dma.fatal() << "missing metadata for instance " << inst_impl->me << " during copy - possible early deletion?";

@cmelone I haven't tried compiling the above, but hopefully it'll work. Can you try adding it to your build and running again with the increased logging that was requested?

(edited the code from the first version - I forgot that fatal logging messages automatically terminate the application now)

cmelone commented 11 months ago

will do -- thanks

cmelone commented 11 months ago
[0 - 7f891972dc80]  121.681378 {6}{dma}: missing metadata for instance 4000000000400000 during copy - possible early deletion?
Legion process received signal 11: Segmentation fault
Process 119577 on node largemem-3-2 is frozen!
Launched: 1 Total time: 165.317 seconds
[0 - 7f894336bc80]  166.960797 {6}{realm}: 1 pending operations during shutdown!
Legion process received signal 6: Aborted
Process 119577 on node largemem-3-2 is frozen!

Logs available at /home/cmelone/2x2x4 on sapling

cmelone commented 11 months ago

Also got this backtrace from a generic segfault:

#0  0x00007fbca02539fd in nanosleep () from /lib64/libc.so.6
#1  0x00007fbca0253894 in sleep () from /lib64/libc.so.6
#2  0x00007fbca1b63162 in Realm::realm_freeze (signal=<optimized out>) at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/runtime_impl.cc:200
#3  <signal handler called>
#4  0x00007fbca020e4a9 in malloc_consolidate () from /lib64/libc.so.6
#5  0x00007fbca0210135 in _int_malloc () from /lib64/libc.so.6
#6  0x00007fbca021378c in malloc () from /lib64/libc.so.6
#7  0x00007fbca180abfe in Serializer (base_bytes=4096, this=0x7fbc0784a840) at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/legion/garbage_collection.cc:1028
#8  Legion::Internal::DistributedCollectable::check_for_downgrade(unsigned int) () at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/legion/garbage_collection.cc:1028
#9  0x00007fbca180dc4f in Legion::Internal::DistributedCollectable::process_downgrade_request(unsigned int, Legion::Internal::DistributedCollectable::State) ()
    at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/legion/garbage_collection.cc:1099
#10 0x00007fbca180dd51 in Legion::Internal::DistributedCollectable::handle_downgrade_request (runtime=runtime@entry=0x3bef440, derez=..., source=source@entry=3)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/legion/garbage_collection.cc:1077
#11 0x00007fbca1767695 in Legion::Internal::Runtime::handle_did_downgrade_request (this=this@entry=0x3bef440, derez=..., source=source@entry=3)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/legion/runtime.cc:24495
#12 0x00007fbca17ce04b in Legion::Internal::VirtualChannel::handle_messages (this=this@entry=0x7fbc6000d6e0, num_messages=num_messages@entry=1, runtime=runtime@entry=0x3bef440,
    remote_address_space=remote_address_space@entry=3, args=0x7fbc5e4b3660 "c\005\002", args@entry=0x7fbc5e4b364c "E", arglen=<optimized out>, arglen@entry=36)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/legion/runtime.cc:12307
#13 0x00007fbca17cf1a9 in Legion::Internal::VirtualChannel::process_message(void const*, unsigned long, Legion::Internal::Runtime*, unsigned int) ()
    at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/legion/runtime.cc:11783
#14 0x00007fbca17cf568 in Legion::Internal::MessageManager::receive_message (this=<optimized out>, args=args@entry=0x7fbc5e4b3640, arglen=44, arglen@entry=48)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/legion/runtime.cc:13501
#15 0x00007fbca17cf599 in Legion::Internal::Runtime::process_message_task (this=this@entry=0x3bef440, args=args@entry=0x7fbc5e4b363c, arglen=48, arglen@entry=52)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/legion/runtime.cc:26246
#16 0x00007fbca17cf640 in Legion::Internal::Runtime::legion_runtime_task (args=0x7fbc5e4b3630, arglen=52, userdata=<optimized out>, userlen=<optimized out>, p=...)
    at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/legion/runtime.cc:32012
#17 0x00007fbca1d6dc41 in Realm::LocalTaskProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&) () at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/bytearray.inl:150
#18 0x00007fbca1c5ddc3 in Realm::Task::execute_on_processor(Realm::Processor) () at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/runtime_impl.h:510
#19 0x00007fbca1c5de56 in Realm::UserThreadTaskScheduler::execute_task (this=<optimized out>, task=<optimized out>) at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/tasks.cc:1687
#20 0x00007fbca1c5c4db in Realm::ThreadedTaskScheduler::scheduler_loop() () at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/tasks.cc:1158
#21 0x00007fbca1c434cf in Realm::UserThread::uthread_entry() () at /home/hpcc/gitlabci/psaap-ci/artifacts/5358942963/legion/runtime/realm/threads.cc:1355
#22 0x00007fbca01d6190 in ?? () from /lib64/libc.so.6
#23 0x0000000000000000 in ?? ()
lightsighter commented 11 months ago

Since this only requires two processes, please make a reproducer on sapling.

Also got this backtrace from a generic segfault:

That is unrelated. Something corrupted to the heap; could be the runtime, but could also be the application. Whatever went wrong it happened way before this backtrace, so it is not interesting.

mariodirenzo commented 11 months ago

I got the following segmentation fault while trying to reproduce this issue on sapling:

#0  0x00007f6ac792323f in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7f6448044920, rem=rem@entry=0x7f6448044920) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
#1  0x00007f6ac7928ec7 in __GI___nanosleep (requested_time=requested_time@entry=0x7f6448044920, remaining=remaining@entry=0x7f6448044920) at nanosleep.c:27
#2  0x00007f6ac7928dfe in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3  0x00007f6acaf0db95 in Realm::realm_freeze (signal=11) at /home/mariodr/legion3/runtime/realm/runtime_impl.cc:200
#4  <signal handler called>
#5  tcache_get (tc_idx=<optimized out>) at malloc.c:2937
#6  __GI___libc_malloc (bytes=4) at malloc.c:3051
#7  0x00007f6ac7c4eb29 in operator new(unsigned long) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007f6aca17f732 in __gnu_cxx::new_allocator<int>::allocate (this=0x7f65617e86d0, __n=1) at /usr/include/c++/9/ext/new_allocator.h:114
#9  0x00007f6aca1729b2 in std::allocator_traits<std::allocator<int> >::allocate (__a=..., __n=1) at /usr/include/c++/9/bits/alloc_traits.h:443
#10 0x00007f6aca162d9c in std::_Vector_base<int, std::allocator<int> >::_M_allocate (this=0x7f65617e86d0, __n=1) at /usr/include/c++/9/bits/stl_vector.h:343
#11 0x00007f6aca147e85 in std::_Vector_base<int, std::allocator<int> >::_M_create_storage (this=0x7f65617e86d0, __n=1) at /usr/include/c++/9/bits/stl_vector.h:358
#12 0x00007f6aca11e7c7 in std::_Vector_base<int, std::allocator<int> >::_Vector_base (this=0x7f65617e86d0, __n=1, __a=...) at /usr/include/c++/9/bits/stl_vector.h:302
#13 0x00007f6aca0f5d1f in std::vector<int, std::allocator<int> >::vector (this=0x7f65617e86d0, __n=1, __a=...) at /usr/include/c++/9/bits/stl_vector.h:508
#14 0x00007f6acaf6775f in Realm::TransferOperation::create_xds (this=0x7f6488240f50) at /home/mariodr/legion3/runtime/realm/transfer/transfer.cc:5024
#15 0x00007f6acaf666f2 in Realm::TransferOperation::allocate_ibs (this=0x7f6488240f50) at /home/mariodr/legion3/runtime/realm/transfer/transfer.cc:4888
#16 0x00007f6acaf69658 in Realm::TransferOperation::DeferredStart::event_triggered (this=0x7f64882410b0, poisoned=false, work_until=...) at /home/mariodr/legion3/runtime/realm/transfer/transfer.cc:5349
#17 0x00007f6acb2606b2 in Realm::EventTriggerNotifier::trigger_event_waiters (this=0x559ef2c294a8, to_trigger=..., poisoned=false, trigger_until=...) at /home/mariodr/legion3/runtime/realm/event_impl.cc:575
#18 0x00007f6acb266c02 in Realm::GenEventImpl::trigger (this=0x7f65f02268b0, gen_triggered=1, trigger_node=1, poisoned=false, work_until=...) at /home/mariodr/legion3/runtime/realm/event_impl.cc:1941
#19 0x00007f6acb08e484 in Realm::Operation::trigger_finish_event (this=0x7f64bc06c120, poisoned=false) at /home/mariodr/legion3/runtime/realm/operation.cc:300
#20 0x00007f6acb08e066 in Realm::Operation::mark_completed (this=0x7f64bc06c120) at /home/mariodr/legion3/runtime/realm/operation.cc:201
#21 0x00007f6acb096213 in Realm::Task::mark_completed (this=0x7f64bc06c120) at /home/mariodr/legion3/runtime/realm/tasks.cc:174
#22 0x00007f6acaf717f2 in Realm::Operation::work_item_finished (this=0x7f64bc06c120, item=0x7f6564045170, successful=true) at /home/mariodr/legion3/runtime/realm/operation.inl:69
#23 0x00007f6acaf718a7 in Realm::Operation::AsyncWorkItem::mark_finished (this=0x7f6564045170, successful=true) at /home/mariodr/legion3/runtime/realm/operation.inl:119
#24 0x00007f6acb39c6b1 in Realm::Cuda::GPUStream::reap_events (this=0x559ef3825d60, work_until=...) at /home/mariodr/legion3/runtime/realm/cuda/cuda_module.cc:405
#25 0x00007f6acb3a45ef in Realm::Cuda::GPUWorker::do_work (this=0x559ef3062270, work_until=...) at /home/mariodr/legion3/runtime/realm/cuda/cuda_module.cc:2339
#26 0x00007f6acaf539be in Realm::BackgroundWorkManager::Worker::do_work (this=0x559ef53be928, max_time_in_ns=100, interrupt_flag=0x559ef53be988) at /home/mariodr/legion3/runtime/realm/bgwork.cc:599
#27 0x00007f6acb09a0e1 in Realm::ThreadedTaskScheduler::wait_for_work (this=0x559ef53be710, old_work_counter=118552) at /home/mariodr/legion3/runtime/realm/tasks.cc:1291
#28 0x00007f6acb09b138 in Realm::KernelThreadTaskScheduler::wait_for_work (this=0x559ef53be710, old_work_counter=118552) at /home/mariodr/legion3/runtime/realm/tasks.cc:1528
#29 0x00007f6acb099f01 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x559ef53be710) at /home/mariodr/legion3/runtime/realm/tasks.cc:1260
#30 0x00007f6acb099ff5 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x559ef53be710) at /home/mariodr/legion3/runtime/realm/tasks.cc:1272
#31 0x00007f6acb0a269c in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x559ef53be710) at /home/mariodr/legion3/runtime/realm/threads.inl:97
#32 0x00007f6acb06e194 in Realm::KernelThread::pthread_entry (data=0x7f6448001bb0) at /home/mariodr/legion3/runtime/realm/threads.cc:781
#33 0x00007f6ac7821609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#34 0x00007f6ac7965133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

If you want to take a look, the process is hanging at

Legion process received signal 11: Segmentation fault
Process 3334814 on node g0002.stanford.edu is frozen!
lightsighter commented 11 months ago

That is also a memory corruption that occurred a long time ago and is either caused by the same underlying issue here or something unrelated.

mariodirenzo commented 11 months ago

The memory corruption seems to go away reverting to the commit 74d60549196cd68e337925484d921acc01002f56. Do you have another commit that might be a good candidate to bisect?

I can still reproduce

/home/mariodr/legion3/runtime/realm/transfer/transfer.cc:161: bool Realm::TransferIteratorBase<N, T>::done() [with int N = 1; T = long long int]: Assertion `inst_impl->metadata.is_valid()' failed.
/home/mariodr/legion3/runtime/realm/utils.h:160: T Realm::checked_cast(T2*) [with T = const Realm::InstanceLayout<1, long long int>*; T2 = Realm::InstanceLayoutGeneric]: Assertion `result != 0' failed.

on 74d60549196cd68e337925484d921acc01002f56. How do you want to proceed with the debug? (I'm using sapling and I can provide the instructions to reproduce the issue)

lightsighter commented 11 months ago

There are no obvious commits between that commit and the current top of control replication that would cause an issue with that. I suspect that they are the same bug and something else just perturbed the timing so the failure mode changes. If you want to bisect it though then you can do that. Or you can just make the reproducer with the commit that you know fails.

mariodirenzo commented 11 months ago

At this point, let's debug

/home/mariodr/legion3/runtime/realm/transfer/transfer.cc:161: bool Realm::TransferIteratorBase<N, T>::done() [with int N = 1; T = long long int]: Assertion `inst_impl->metadata.is_valid()' failed.`

using 74d60549196cd68e337925484d921acc01002f56 and we see what happens next. I'll send the instructions to repro the issue on slack.

mariodirenzo commented 11 months ago

@elliottslaughter, Can you please add this issue to #1032 ?

lightsighter commented 9 months ago

@cmelone please try the futurevals branch of Legion and see if the issue is resolved and then report back the result.

cmelone commented 9 months ago

@lightsighter has this been merged? (sorry if I'm late, was without my laptop)

I can't find the branch. The original crash is still coming up on the latest commit of CR.

lightsighter commented 9 months ago

Which commit are you on?

lightsighter commented 9 months ago

Also, are you running with -lg:safe_mapper?

lightsighter commented 9 months ago

Pull and try again and be sure you are on this commit: https://gitlab.com/StanfordLegion/legion/-/commit/fccc49034e05de442609088033b7139cef00ab6c

If it still reproduces after that then make a reproducer on sapling.

cmelone commented 9 months ago

Not seeing this crash anymore, thank you!