Open Calebsakhtar opened 4 months ago
@Calebsakhtar - was this fixed by commit 85a56a3afdf3ac3bd06231d43202a629dc480e8b? More generally, do you still see this bug when running with >1 thread?
@sdeastham Just to report that compiling APCEMM on commit https://github.com/MIT-LAE/APCEMM/commit/85a56a3afdf3ac3bd06231d43202a629dc480e8b still results in the above bug. I will now attempt compilation on the latest commit https://github.com/MIT-LAE/APCEMM/commit/618f20f2ddbcdeb62cf6fabdea66ddd477a1805b
Here are the instructions to replicate the behaviour reported above:
Please note that this behaviour has been observed in both Windows 11 Docker and on the Linux system of the Cambridge HPC.
@sdeastham Just to report that compiling APCEMM on the latest commit https://github.com/MIT-LAE/APCEMM/commit/618f20f2ddbcdeb62cf6fabdea66ddd477a1805b still results in the above bug.
Thanks @Calebsakhtar ! To confirm, is that the result when outputting the standard "depth" variable directly or are you calculating a different kind of depth?
@sdeastham The standard depth variable straight from APCEMM!
Got it! OK - issue is reproducible on our HPC (in fact, it looks much worse):
This seems to have the largest effect on these diagnostic variables. Prognostic variables like ice mass show very small differences (although these should still be nailed down, as they shouldn't happen for this case where there is in theory no randomness as temperature perturbation is disabled for example 3):
@michaelxu3 any thoughts you might have on origin would be appreciated! In any case, I'll try to drill down and see if there's an obvious cause of this behaviour.
@Calebsakhtar - can you confirm that this behaviour remains/disappears when:
export OMP_NUM_THREADS=1
) (but input.yaml still lists 8)?@Calebsakhtar Also, was the profile you showed in the original post for Example 3 or for a different case? If it's example 3, that raises the question of why our profiles are so different (even setting aside the noise).
@sdeastham The profile I showed was for one of the cases with my custom met conditions, not any of the examples. Sorry for not specifying this sooner.
@sdeastham It will take me a while to confirm the other two cases, but at this time I can confirm that setting export OMP_NUM_THREADS=1
and specifying one core in the input.yaml file does result in the bug disappearing.
@sdeastham Finally got around to finishing the HPC runs.
Here are the results:
OpenMP Num Threads (positive int)
is set to 8 in input.yaml
, the bug appears regardless of the value of OMP_NUM_THREADS
and --cpus-per-task
in my SLURM script.OpenMP Num Threads (positive int)
is set to 1 in input.yaml
, the bug disappears regardless of the value of OMP_NUM_THREADS
and --cpus-per-task
in my SLURM script.Well, that is odd... thanks @Calebsakhtar ! I'll see if I can figure out what is going on.
@sdeastham Sadly I have just rerun some cases with OpenMP Num Threads = 8 in the input.yaml file and it has resultied in the jumps being present in the extinction-defined width (standard APCEMM output)
There seems to be an issue with threads reading and writing parts of the memory at the same time.
Here are APCEMM outputs with two consecutive runs when using 8 threads:
I was not able to recreate the random jumps with
OpenMP Num Threads
set to 1, but I was with more threads.