MFlowCode / MFC

Exascale simulation of multiphase/physics fluid dynamics
https://mflowcode.github.io
MIT License
132 stars 56 forks source link

Frontier CI sporadically failed 2-rank test in test suite #435

Open sbryngelson opened 1 month ago

sbryngelson commented 1 month ago

Frontier CI sporadically failed 2-rank test in test suite.

One such example: https://github.com/MFlowCode/MFC/actions/runs/9230003014/job/25397349146

One note is that we automatically get all GPUs when we request a node (exclusive access) on Frontier, though I'm not sure how many we actually use.

I pass this argument in the batch job:

https://github.com/MFlowCode/MFC/blob/4f89f33739da7df6a74151afc7ef89c0f41f2bc9/.github/workflows/frontier/test.sh#L3C1-L3C37

which I guess tries to run 4 tests at once. So, there should always be enough GPUs available. Maybe they are overlapping when we have the 2 rank case?

Notably this is much different than the clever setup that @henryleberre wrote for the Phoenix case: https://github.com/MFlowCode/MFC/blob/4f89f33739da7df6a74151afc7ef89c0f41f2bc9/.github/workflows/phoenix/test.sh#L10-L19