StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
675 stars 145 forks source link

Nondeterministic freezes in CUDA CI #1694

Closed elliottslaughter closed 4 months ago

elliottslaughter commented 4 months ago

We're seeing nondeterministic freezes in CUDA CI jobs, e.g.:

The freezing program is hello_world, which seems pretty basic. It seems to happen in just about any configuration, e.g., no network is required.

elliottslaughter commented 4 months ago

Mike pointed out that all the failing jobs are on runner nv-legion-ci-03-2 so this is probably a runner-specific issue and will hopefully go away with a reboot.

lightsighter commented 4 months ago

@elliottslaughter The bad runner has been rebooted. Can you confirm if you're still seeing issues?

elliottslaughter commented 4 months ago

All my reruns look good so far.