StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
676 stars 145 forks source link

Invalid index space sparsity handle #1768

Open elliottslaughter opened 23 hours ago

elliottslaughter commented 23 hours ago

I'm seeing this failure mode in Frontier CI on latest master:

srun -n4 --cpus-per-task 14 --gpus-per-task 2 --cpu-bind cores --network=single_node_vni /lustre/orion/ums036/proj-shared/ci/32658_72820/tmpvz8pxq5r/build/bin/nested_replication -logfile out_%.log -ll:gpu 1 -ll:fsize 1024 -ll:msize 64 -ll:cpu 4
[0 - 7fffe586e380]    0.085869 {6}{realm}: invalid index space sparsity handle: id=28
nested_replication: /lustre/orion/ums036/proj-shared/ci/32658_72820/runtime/realm/runtime_impl.cc:2995: Realm::SparsityMapImplWrapper* Realm::RuntimeImpl::get_sparsity_impl(Realm::ID): Assertion `0 && "invalid index space sparsity handle"' failed.

https://code.olcf.ornl.gov/ci/ums036/dev/legion/-/jobs/72820

Although this is technically a HIP run, I don't think it's a HIP issue? I suspect we just have a regression in the sparsity map code on multiple nodes, and maybe our fake multi-node CI jobs are just insufficient to catch it.

lightsighter commented 22 hours ago

I think we'll probably need a backtrace to start to know where to go looking.

elliottslaughter commented 18 hours ago

This is nondeterminstic, even in CI. I'm still trying to find a way to reproduce it.