Closed Rombur closed 3 months ago
@Rombur, unfortunately simulations of the gear are still hanging for me. Let me see if I can localize where.
@Rombur, it's working on my CADES VM but not on my laptop -- not sure why. It could be a Docker Desktop issue. Or more specifically I'm using an x86 container on my ARM laptop. Normally that's fine, but maybe that's causing an issue here.
The new test also hangs in the CI but it works fine on my machine. I'll try to reproduce the problem on a different machine.
Actually I am not sure that the code hangs in the CI. I think it's just extremely slow for some reasons. @stvdwtt can you use gdb to find where the code hangs for you.
@Rombur, unfortunately I can't run the debugger on my laptop because how the VM in Docker Desktop works. I'm not sure if this is because I'm running an x86 image on my ARM laptop or if it would happen with an ARM image as well. (I tried a bit ago to build a new image for ARM, but couldn't get past a Trilinos error.)
On ORC the gear case runs fine in release, debug, and debug in gdb.
At this point, I think the best I can do is put print statements in and try to localize it that way on my laptop.
After increasing the timeout, the new test passes on the CI
@Rombur, I'm ok to merge this if you are. Maybe you already checked this, but the longer test time seems to only be due to the new hourglass test.
Potential things to address in this PR or to wait for the future:
I'm ok to wait to do these until the future (especially 2, since we don't know if it is related to the bug this PR fixes or not).
Modify the hourglass test to run more quickly (I think most of the time is spent loading the mesh, so decreasing the number of time steps won't help, I'd have to be using a coarser mesh)
The test runs in 100 s on my machine. I've had the same issue with one of Jean-Luc's code. One test was fast locally but it was extremely slow in the CI. We never found out why.
I don't think most of the time is spent loading the mesh at least it doesn't look that way from the output.
I've already decreased the number of time steps quite a bit. I am worry that if we decrease the number of time steps even more, we won't reproduce the error anymore.
Sort out the Mac hang bug
Without knowing where the code hangs, I don't think we will be able to fix it.
I think we should merge this PR since it fixes a real bug and it allows @AshGannon to run the HourGlass in parallel. It's annoying that the CI is so slow but unless it becomes an issue for our workflow, I don't really want to spend time on this problem.
@stvdwtt and @AshGannon can you try this PR on the HourGlass and the gear. The PR fixes for the issue on the HourGlass for me.