StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
675 stars 145 forks source link

Realm: Assertion `finder != device_functions.end()' failed #1682

Closed syamajala closed 5 months ago

syamajala commented 5 months ago

I'm hitting the following assertion in Realm:

s3d.x: /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb_v2/legion/runtime/realm/cuda/cuda_module.cc:3945: CUfunc_st* Realm::Cuda::GPU::lookup_function(const void*): Assertion `finder != device_functions.end()' failed.

This only seems to be happening on Perlmutter. I was able to run on blaze and sapling without any problems. I tried cuda 11.7, 12.0, and 12.2 on Perlmutter, but they all have the same issue.

Here is a stack trace:

#0  0x00007f50186e5121 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
#1  0x00007f50186eae43 in nanosleep () from /lib64/libc.so.6
#2  0x00007f50186ead5a in sleep () from /lib64/libc.so.6
#3  0x00007f500195089a in Realm::realm_freeze (signal=6) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/runtime_impl.cc:206
#4  <signal handler called>
#5  0x00007f5018653d2b in raise () from /lib64/libc.so.6
#6  0x00007f50186553e5 in abort () from /lib64/libc.so.6
#7  0x00007f501864bc6a in __assert_fail_base () from /lib64/libc.so.6
#8  0x00007f501864bcf2 in __assert_fail () from /lib64/libc.so.6
#9  0x00007f50019e5797 in Realm::Cuda::GPU::lookup_function (this=0x9b3c2e0,
    func=0x7f50067abdfa <Realm::Cuda::ReductionKernels::apply_cuda_kernel<Legion::Internal::AddCudaReductions<Legion::SumReduction<double> >, true>(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, Legion::Internal::AddCudaReductions<Legion::SumReduction<double> >)>) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/cuda/cuda_module.cc:3945
#10 0x00007f5001a1eae5 in Realm::Cuda::GPUreduceXferDes::GPUreduceXferDes (this=0x7f32ec0ec170, _dma_op=139858631131232, _channel=0xc6f2d70, _launch_node=0, _guid=945, inputs_info=..., outputs_info=..., _priority=0, _redop_info=...)
    at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/cuda/cuda_internal.cc:2132
#11 0x00007f5001a20620 in Realm::Cuda::GPUreduceChannel::create_xfer_des (this=0xc6f2d70, dma_op=139858631131232, launch_node=0, guid=945, inputs_info=..., outputs_info=..., priority=0, redop_info=..., fill_data=0x0, fill_size=0,
    fill_total=0) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/cuda/cuda_internal.cc:2533
#12 0x00007f50013416eb in Realm::SimpleXferDesFactory::create_xfer_des (this=0xc6f2d98, dma_op=139858631131232, launch_node=0, target_node=0, guid=945, inputs_info=..., outputs_info=..., priority=0, redop_info=..., fill_data=0x0,
    fill_size=0, fill_total=0) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/transfer/channel.cc:4686
#13 0x00007f5001380725 in Realm::TransferOperation::create_xds (this=0x7f3360070060) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/transfer/transfer.cc:5594
#14 0x00007f500137dfd6 in Realm::TransferOperation::allocate_ibs (this=0x7f3360070060) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/transfer/transfer.cc:5217
#15 0x00007f500137d13c in Realm::TransferOperation::start_or_defer (this=0x7f3360070060) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/transfer/transfer.cc:5059
#16 0x00007f50013a1e2c in Realm::IndexSpace<2, long long>::copy (this=0x7f35aafc7360, srcs=..., dsts=..., indirects=..., requests=..., wait_on=..., priority=0)
    at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/transfer/transfer.cc:5723
#17 0x00007f5006ac7a19 in Realm::IndexSpace<2, long long>::copy (this=0x7f35aafc7360, srcs=..., dsts=..., requests=..., wait_on=..., priority=0) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/indexspace.inl:903
#18 0x00007f5006ab046e in Legion::Internal::IndexSpaceExpression::issue_copy_internal<2, long long> (this=0x7f36800e4bb0, forest=0xc6fd1e0, op=0x7f365c01f520, space=..., trace_info=..., dst_fields=..., src_fields=..., reservations=...,
    precondition=..., pred_guard=..., src_unique=..., dst_unique=..., collective=Legion::Internal::COLLECTIVE_NONE, priority=0, replay=false) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/region_tree.inl:239
#19 0x00007f5006a8dff4 in Legion::Internal::IndexSpaceNodeT<2, long long>::issue_copy (this=0x7f36800e4880, op=0x7f365c01f520, trace_info=..., dst_fields=..., src_fields=..., reservations=..., precondition=..., pred_guard=...,
    src_unique=..., dst_unique=..., collective=Legion::Internal::COLLECTIVE_NONE, priority=0, replay=false) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/region_tree.inl:4988
#20 0x00007f500648a5e7 in Legion::Internal::IndividualView::copy_from (this=0x7f3360191320, src_view=0x7f32ec1ab560, precondition=..., predicate_guard=..., reduction_op_id=1048587, copy_expression=0x7f36800e4bb0, op=0x7f365c01f520,
    index=0, collective_match_space=13, copy_mask=..., src_point=0x7f32ec1a83c0, trace_info=..., recorded_events=..., applied_events=..., across_helper=0x0, manage_dst_events=true, copy_restricted=false, need_valid_return=false)
    at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/legion_views.cc:2471
#21 0x00007f5005e6d4ba in Legion::Internal::CopyFillAggregator::issue_copies (this=0x7f33601975a0, target=0x7f3360191320, copies=..., recorded_events=..., precondition=..., copy_mask=..., trace_info=..., manage_dst_events=true,
    restricted_output=false, dst_events=0x0) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/legion_analysis.cc:7328
#22 0x00007f5005e6ba20 in Legion::Internal::CopyFillAggregator::perform_updates (this=0x7f33601975a0, updates=..., trace_info=..., precondition=..., recorded_events=..., redop_index=0, manage_dst_events=true, restricted_output=false,
    dst_events=0x0) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/legion_analysis.cc:7053
#23 0x00007f5005e6b2b6 in Legion::Internal::CopyFillAggregator::issue_updates (this=0x7f33601975a0, trace_info=..., precondition=..., restricted_output=false, manage_dst_events=true, dst_events=0x0, stage=0)
    at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/legion_analysis.cc:6911
#24 0x00007f5005e77cfc in Legion::Internal::UpdateAnalysis::perform_updates (this=0x7f3360190d40, perform_precondition=..., applied_events=..., already_deferred=false)
    at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/legion_analysis.cc:9294
#25 0x00007f5006506ef7 in Legion::Internal::RegionTreeForest::physical_perform_updates (this=0xc6fd1e0, req=..., version_info=..., op=0x7f365c01f520, index=0, precondition=..., term_event=..., targets=..., sources=..., trace_info=...,
    map_applied_events=..., analysis=@0x7f336011e620: 0x7f3360190d40, log_name=0x457c7a0 "copy_global", uid=6150, collective_rendezvous=false, record_valid=true, check_initialized=true, defer_copies=true)
    at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/region_tree.cc:1923
#26 0x00007f500639e22f in Legion::Internal::SingleTask::map_all_regions (this=0x7f365c01f340, must_epoch_op=0x0, defer_args=0x0) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/legion_tasks.cc:4253
#27 0x00007f50063ab02e in Legion::Internal::PointTask::perform_mapping (this=0x7f365c01f340, must_epoch_owner=0x0, args=0x0) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/legion_tasks.cc:7289
#28 0x00007f50063bc991 in Legion::Internal::SliceTask::perform_mapping (this=0x7f333c0c6b40, epoch_owner=0x0, args=0x0) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/legion_tasks.cc:11453
#29 0x00007f50063a4ea7 in Legion::Internal::MultiTask::trigger_mapping (this=0x7f333c0c6b40) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/legion_tasks.cc:5607
#30 0x00007f50066263bd in Legion::Internal::Runtime::legion_runtime_task (args=0x7f33600a2560, arglen=12, userdata=0xc726700, userlen=8, p=...) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/legion/runtime.cc:32276
#31 0x00007f500192987b in Realm::LocalTaskProcessor::execute_task (this=0xc67d260, func_id=4, task_args=...) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/proc_impl.cc:1176
#32 0x00007f500199f21c in Realm::Task::execute_on_processor (this=0x7f33600a23e0, p=...) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/tasks.cc:326
#33 0x00007f50019a31e4 in Realm::KernelThreadTaskScheduler::execute_task (this=0xc67d600, task=0x7f33600a23e0) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/tasks.cc:1421
#34 0x00007f50019a202c in Realm::ThreadedTaskScheduler::scheduler_loop (this=0xc67d600) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/tasks.cc:1160
#35 0x00007f50019a265a in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0xc67d600) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/tasks.cc:1272
#36 0x00007f50019a9508 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0xc67d600)
    at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/threads.inl:97
#37 0x00007f50019b58bf in Realm::KernelThread::pthread_entry (data=0x7f35a0130e20) at /pscratch/sd/s/seshu/gb2024/legion_s3d_tdb/legion/runtime/realm/threads.cc:831
#38 0x00007f501c1b76ea in start_thread () from /lib64/libpthread.so.0
#39 0x00007f501872149f in clone () from /lib64/libc.so.6
lightsighter commented 5 months ago

@muraj Can you take a look at this?

syamajala commented 5 months ago

Could really use some help with this as Im trying to get some Gordon Bell runs done on Perlmutter.

eddy16112 commented 5 months ago

After talked with @syamajala , we figured out that the issue is that we build the Realm with CUDART_HIJACK=ON, but the cray wrappers links the cudart automatically, so CUDART_HIJACK is not turned on during runtime, and all these kernels are not registered with Realm.

syamajala commented 5 months ago

Yeah the cray wrappers are broken and have no way to turn off linking against cudart. I opened a NERSC ticket about this a year ago. Their solution was for me to manually link things by hand and just remove -lcudart from the flags. They closed the ticket.

I am able to do my runs now.