StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
683 stars 145 forks source link

Realm: Assertion `size <= ib_seg_size' failed #1769

Open syamajala opened 4 weeks ago

syamajala commented 4 weeks ago

I am seeing the following assertion when running cunumeric with ucx:

python: /sdf/home/s/seshu/src/legate/arch-linux-py-debug/cmake_build/_deps/legion-src/runtime/realm/ucx/ucp_internal.cc:1827: void* Realm::UCP::UCPInternal::pbuf_get(Realm::UCP::UCPWorker*, size_t): Assertion `size <= ib_seg_size' failed.

Here is a stack trace:

#0  0x00007f28e975c658 in nanosleep () from /lib64/libc.so.6                                                          
#1  0x00007f28e975c55e in sleep () from /lib64/libc.so.6
#2  0x00007f2833182f99 in Realm::realm_freeze (signal=6) at /sdf/home/s/seshu/src/legate/arch-linux-py-debug/cmake_build/proc_impl.h:206
#3  <signal handler called>
#4  0x00007f28e96b0a9f in raise () from /lib64/libc.so.6
#5  0x00007f28e9683e05 in abort () from /lib64/libc.so.6
#6  0x00007f28e9683cd9 in __assert_fail_base.cold.0 () from /lib64/libc.so.6                                          
#7  0x00007f28e96a93f6 in __assert_fail () from /lib64/libc.so.6                                                      
#8  0x00007f28332821fe in Realm::UCP::UCPInternal::pbuf_get (this=0x55998ec48db0, worker=0x55998e4cac00, size=26496) at /sdf/home/s/seshu/src/legate/arch-linux-py-debug/cmake_build/ucp_context.h:1827                                      
#9  0x00007f283328285f in Realm::UCP::UCPMessageImpl::UCPMessageImpl (this=0x7ee66cc0d530, _internal=0x55998ec48db0, _target=0, _msgid=54, _header_size=32, _max_payload_size=26496, _src_payload_addr=0x0, _src_payload_lines=0, _src_payload_line_stride=0, _src_segment=0x0, _dest_payload_addr=0x0, _storage_size=256) at /sdf/home/s/seshu/src/legate/arch-linux-py-debug/cmake_build/ucp_context.h:1927
#10 0x00007f283327743a in Realm::UCPModule::create_active_message_impl (this=0x55998ec48d00, target=0, msgid=54, header_size=32, max_payload_size=26496, src_payload_addr=0x0, src_payload_lines=0, src_payload_line_stride=0, storage_base=0x7ee66cc0d530, storage_size=256) at /sdf/group/lcls/ds/tools/conda_envs/cunumeric-nightly/lib/gcc/x86_64-conda-linux-gnu/11.4.0/include/c++/network.h:259
#11 0x00007f2832c8fbb0 in Realm::Network::create_active_message_impl (target=0, msgid=54, header_size=32, max_payload_size=26496, src_payload_addr=0x0, src_payload_lines=0, src_payload_line_stride=0, storage_base=0x7ee66cc0d530, storage_size=256) at /sdf/group/lcls/ds/tools/conda_envs/cunumeric-nightly/lib/gcc/x86_64-conda-linux-gnu/11.4.0/include/c++/memory.h:100
#12 0x00007f2833058eca in Realm::ActiveMessage<Realm::BarrierTriggerMessage, 256ul>::init (this=0x7ee66cc0d510, _target=0, _max_payload_size=26496) at /sdf/home/s/seshu/src/legate/arch-linux-py-debug/cmake_build/dynamic_table.inl:53     
#13 0x00007f28330534ca in Realm::ActiveMessage<Realm::BarrierTriggerMessage, 256ul>::ActiveMessage (this=0x7ee66cc0d510, _target=0, _max_payload_size=26496) at /sdf/home/s/seshu/src/legate/arch-linux-py-debug/cmake_build/dynamic_table.inl:44
#14 0x00007f28330484b6 in Realm::BarrierTriggerMessage::send_request (target=0, barrier_id=2305860601471041536, trigger_gen=939, previous_gen=111, first_generation=0, redop_id=1048576, migration_target=-1, base_arrival_count=2, data=0x7ee652f764c0, datalen=26496) at /sdf/home/s/seshu/src/legate/arch-linux-py-debug/cmake_build/runtime_impl.h:2091         
#15 0x00007f2833049e74 in Realm::BarrierImpl::adjust_arrival (this=0x7ee374203f80, barrier_gen=112, delta=-1, timestamp=0, wait_on=..., sender=0, forwarded=false, reduce_value=0x7ee6b3546bd0, reduce_value_size=32, work_until=...) at /sdf/home/s/seshu/src/legate/arch-linux-py-debug/cmake_build/runtime_impl.h:2494                                           
#16 0x00007f2833048205 in Realm::BarrierAdjustMessage::handle_message (sender=0, args=..., data=0x7ee6b3546bd0, datalen=32, work_until=...) at /sdf/home/s/seshu/src/legate/arch-linux-py-debug/cmake_build/runtime_impl.h:2055              
#17 0x00007f283304f75c in Realm::HandlerWrappers::wrap_handler<Realm::BarrierAdjustMessage, Realm::BarrierAdjustMessage::handle_message> (sender=0, header=0x7ee4c9696df0, payload=0x7ee6b3546bd0, payload_size=32, work_until=...) at /sdf/home/s/seshu/src/legate/arch-linux-py-debug/cmake_build/dynamic_table.inl:620                                           
#18 0x00007f28332a3ae5 in Realm::IncomingMessageManager::do_work (this=0x55998f739e70, work_until=...) at /sdf/group/lcls/ds/tools/conda_envs/cunumeric-nightly/lib/gcc/x86_64-conda-linux-gnu/11.4.0/include/c++/ia32intrin.h:740           
#19 0x00007f283303cc48 in Realm::BackgroundWorkManager::Worker::do_work (this=0x7ee66cc0f0d0, max_time_in_ns=-1, interrupt_flag=0x0) at /sdf/group/lcls/ds/tools/conda_envs/cunumeric-nightly/lib/gcc/x86_64-conda-linux-gnu/11.4.0/include/c++/ia32intrin.h:600
#20 0x00007f283303a8aa in Realm::BackgroundWorkThread::main_loop (this=0x55999c3ee850) at /sdf/group/lcls/ds/tools/conda_envs/cunumeric-nightly/lib/gcc/x86_64-conda-linux-gnu/11.4.0/include/c++/ia32intrin.h:103                           
#21 0x00007f283303de4c in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x55999c3ee850) at /sdf/home/s/seshu/src/legate/arch-linux-py-debug/cmake_build/mutex.inl:97        
#22 0x00007f28331e3318 in Realm::KernelThread::pthread_entry (data=0x55999c4d0e40) at /sdf/home/s/seshu/src/legate/arch-linux-py-debug/cmake_build/_deps/legion-src/runtime/realm/stl_map.h:854                                              
#23 0x00007f28ea1b91cf in start_thread () from /lib64/libpthread.so.0                                                 
#24 0x00007f28e969bdd3 in clone () from /lib64/libc.so.6
syamajala commented 4 weeks ago

This error only seems to appear when I run with profiling.

eddy16112 commented 4 weeks ago

which branch are you using? There is no line 2xxx in runtime_impl.h

syamajala commented 4 weeks ago
commit c032dab254f423ccab36d05c47fed42b94f0b3f5 (HEAD)
Merge: da0427fad 78d10af37
Author: Elliott Slaughter <elliottslaughter@gmail.com>
Date:   Thu Sep 12 23:37:18 2024 +0000

    Merge branch 'ci-fix-gasnet-stable' into 'master'

    ci: Fix CI for GASNet stable branch

    See merge request StanfordLegion/legion!1463
eddy16112 commented 4 weeks ago

I thought you were using the new barrier branch. The error means the reduction value used by barrier is too big. Currently, the active message upper limit of UCX is 8K, I am surprised that Legion uses such a big value with barrier. You can try to increase the size to 16K for now https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/ucx/ucp_internal.cc?ref_type=heads#L67

syamajala commented 4 weeks ago

It needs to be 32K. I also have to run with -ucx:pb_max_size 32768.

syamajala commented 4 weeks ago

@eddy16112 do you expect this will go away with the new barrier implementation or should i leave this issue open for now?

eddy16112 commented 4 weeks ago

In the new barrier branch, we need to send the tree to child nodes, so we have seen cases that header+tree+payload > max_size, our solution is to divide the tree into fragments, but in your case, header+payload is already larger than max_size, so I do not think we fixed this bug.

Let's keep this bug open for now.

apryakhin commented 3 weeks ago

Yeah, I am surprised we are hitting this on the legacy branch. Likely that's just been there for a while and never tested.

syamajala commented 3 weeks ago

I think outside of legate no one really uses ucx. I tried to use ucx on Summit a year ago, but hit #1396 and #1059. Most of the big machines these days are on slingshot-11 now, but the s3df cluster at SLAC is 100Gb ethernet and even with gasnet we would need to use the ucx conduit in that case.

lightsighter commented 3 weeks ago

Likely that's just been there for a while and never tested.

The new critical path profiling infrastructure will bang on it in a way that it didn't used to get used very often, which is very likely what is happening here.

eddy16112 commented 3 weeks ago

@lightsighter I am surprised that the critical path profiling uses such a big reduction data, almost 36K.

lightsighter commented 3 weeks ago

Legion is not using that large of a reduction. It's reducing this data structure which is not 36K:

https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion/legion_profiling.h?ref_type=heads#L62-74

I suspect there is a performance bug in Realm where it always sends the reduced value for all generations which continues to grow larger and larger rather than sending the reduced values for the difference in subscribed generations.

apryakhin commented 3 weeks ago

I suspect there is a performance bug in Realm where it always sends the reduced value for all generations which continues to grow larger and larger rather than sending the reduced values for the difference in subscribed generations.

It think it's close but not exactly - the owner can collapse subscribe generations into a single notify active message depending on what was the latest subscribe generation observed. In case, we are running 1000 gens that would result into a single notification for M subscribers each of which will carry 1000 * reduce_value_size. That is where I think we are hitting this. The solution is in my opinion should be much like we do in the "scalable" barrier branch where you would do a staged-broadcast where a number of stages equals to the max_payload_size / payload_size.

apryakhin commented 2 weeks ago

@syamajala Is this blocking? I can consider doing a separate fix if we need it immediately otherwise we can wait for when we merge the scalable barrier branch that should in my opinion address it.

syamajala commented 2 weeks ago

I have a work around for right now so it's not blocking.