MIT-LAE / APCEMM

Aircraft Plume Chemistry, Emissions, and Microphysics Model
MIT License
7 stars 16 forks source link

Inconsistent Results When "OpenMP Num Threads" is Greater Than 1 #19

Open Calebsakhtar opened 2 months ago

Calebsakhtar commented 2 months ago

There seems to be an issue with threads reading and writing parts of the memory at the same time.

Here are APCEMM outputs with two consecutive runs when using 8 threads: Run1 Run2

I was not able to recreate the random jumps with OpenMP Num Threads set to 1, but I was with more threads.

sdeastham commented 1 month ago

@Calebsakhtar - was this fixed by commit 85a56a3afdf3ac3bd06231d43202a629dc480e8b? More generally, do you still see this bug when running with >1 thread?

Calebsakhtar commented 3 weeks ago

@sdeastham Just to report that compiling APCEMM on commit https://github.com/MIT-LAE/APCEMM/commit/85a56a3afdf3ac3bd06231d43202a629dc480e8b still results in the above bug. I will now attempt compilation on the latest commit https://github.com/MIT-LAE/APCEMM/commit/618f20f2ddbcdeb62cf6fabdea66ddd477a1805b

Calebsakhtar commented 3 weeks ago

Here are the instructions to replicate the behaviour reported above:

  1. Clone the APCEMM git repo
  2. Follow the README installation instructions from the repo
  3. Run example 3

Please note that this behaviour has been observed in both Windows 11 Docker and on the Linux system of the Cambridge HPC.

Calebsakhtar commented 3 weeks ago

@sdeastham Just to report that compiling APCEMM on the latest commit https://github.com/MIT-LAE/APCEMM/commit/618f20f2ddbcdeb62cf6fabdea66ddd477a1805b still results in the above bug.

sdeastham commented 2 weeks ago

Thanks @Calebsakhtar ! To confirm, is that the result when outputting the standard "depth" variable directly or are you calculating a different kind of depth?

Calebsakhtar commented 2 weeks ago

@sdeastham The standard depth variable straight from APCEMM!

sdeastham commented 2 weeks ago

Got it! OK - issue is reproducible on our HPC (in fact, it looks much worse):

image

This seems to have the largest effect on these diagnostic variables. Prognostic variables like ice mass show very small differences (although these should still be nailed down, as they shouldn't happen for this case where there is in theory no randomness as temperature perturbation is disabled for example 3):

image

@michaelxu3 any thoughts you might have on origin would be appreciated! In any case, I'll try to drill down and see if there's an obvious cause of this behaviour.

@Calebsakhtar - can you confirm that this behaviour remains/disappears when:

sdeastham commented 2 weeks ago

@Calebsakhtar Also, was the profile you showed in the original post for Example 3 or for a different case? If it's example 3, that raises the question of why our profiles are so different (even setting aside the noise).

Calebsakhtar commented 2 weeks ago

@sdeastham The profile I showed was for one of the cases with my custom met conditions, not any of the examples. Sorry for not specifying this sooner.

Calebsakhtar commented 2 weeks ago

@sdeastham It will take me a while to confirm the other two cases, but at this time I can confirm that setting export OMP_NUM_THREADS=1 and specifying one core in the input.yaml file does result in the bug disappearing.

Calebsakhtar commented 1 week ago

@sdeastham Finally got around to finishing the HPC runs.

Here are the results:

sdeastham commented 1 week ago

Well, that is odd... thanks @Calebsakhtar ! I'll see if I can figure out what is going on.