adamantine-sim / adamantine

Software to simulate heat transfer for additive manufacturing
https://adamantine-sim.github.io/adamantine/
Other
37 stars 10 forks source link

Fix bug when adding material in the distributed case #292

Closed Rombur closed 3 months ago

Rombur commented 4 months ago

@stvdwtt and @AshGannon can you try this PR on the HourGlass and the gear. The PR fixes for the issue on the HourGlass for me.

stvdwtt commented 4 months ago

@Rombur, unfortunately simulations of the gear are still hanging for me. Let me see if I can localize where.

stvdwtt commented 4 months ago

@Rombur, it's working on my CADES VM but not on my laptop -- not sure why. It could be a Docker Desktop issue. Or more specifically I'm using an x86 container on my ARM laptop. Normally that's fine, but maybe that's causing an issue here.

Rombur commented 4 months ago

The new test also hangs in the CI but it works fine on my machine. I'll try to reproduce the problem on a different machine.

Rombur commented 3 months ago

Actually I am not sure that the code hangs in the CI. I think it's just extremely slow for some reasons. @stvdwtt can you use gdb to find where the code hangs for you.

stvdwtt commented 3 months ago

@Rombur, unfortunately I can't run the debugger on my laptop because how the VM in Docker Desktop works. I'm not sure if this is because I'm running an x86 image on my ARM laptop or if it would happen with an ARM image as well. (I tried a bit ago to build a new image for ARM, but couldn't get past a Trilinos error.)

On ORC the gear case runs fine in release, debug, and debug in gdb.

At this point, I think the best I can do is put print statements in and try to localize it that way on my laptop.

Rombur commented 3 months ago

After increasing the timeout, the new test passes on the CI

stvdwtt commented 3 months ago

@Rombur, I'm ok to merge this if you are. Maybe you already checked this, but the longer test time seems to only be due to the new hourglass test.

Potential things to address in this PR or to wait for the future:

  1. Modify the hourglass test to run more quickly (I think most of the time is spent loading the mesh, so decreasing the number of time steps won't help, I'd have to be using a coarser mesh)
  2. Sort out the Mac hang bug

I'm ok to wait to do these until the future (especially 2, since we don't know if it is related to the bug this PR fixes or not).

Rombur commented 3 months ago

Modify the hourglass test to run more quickly (I think most of the time is spent loading the mesh, so decreasing the number of time steps won't help, I'd have to be using a coarser mesh)

The test runs in 100 s on my machine. I've had the same issue with one of Jean-Luc's code. One test was fast locally but it was extremely slow in the CI. We never found out why. I don't think most of the time is spent loading the mesh at least it doesn't look that way from the output.
I've already decreased the number of time steps quite a bit. I am worry that if we decrease the number of time steps even more, we won't reproduce the error anymore.

Sort out the Mac hang bug

Without knowing where the code hangs, I don't think we will be able to fix it.

I think we should merge this PR since it fixes a real bug and it allows @AshGannon to run the HourGlass in parallel. It's annoying that the CI is so slow but unless it becomes an issue for our workflow, I don't really want to spend time on this problem.