Open sbryngelson opened 1 month ago
Frontier CI sporadically failed 2-rank test in test suite.
One such example: https://github.com/MFlowCode/MFC/actions/runs/9230003014/job/25397349146
One note is that we automatically get all GPUs when we request a node (exclusive access) on Frontier, though I'm not sure how many we actually use.
I pass this argument in the batch job:
https://github.com/MFlowCode/MFC/blob/4f89f33739da7df6a74151afc7ef89c0f41f2bc9/.github/workflows/frontier/test.sh#L3C1-L3C37
which I guess tries to run 4 tests at once. So, there should always be enough GPUs available. Maybe they are overlapping when we have the 2 rank case?
Notably this is much different than the clever setup that @henryleberre wrote for the Phoenix case: https://github.com/MFlowCode/MFC/blob/4f89f33739da7df6a74151afc7ef89c0f41f2bc9/.github/workflows/phoenix/test.sh#L10-L19
Frontier CI sporadically failed 2-rank test in test suite.
One such example: https://github.com/MFlowCode/MFC/actions/runs/9230003014/job/25397349146
One note is that we automatically get all GPUs when we request a node (exclusive access) on Frontier, though I'm not sure how many we actually use.
I pass this argument in the batch job:
https://github.com/MFlowCode/MFC/blob/4f89f33739da7df6a74151afc7ef89c0f41f2bc9/.github/workflows/frontier/test.sh#L3C1-L3C37
which I guess tries to run 4 tests at once. So, there should always be enough GPUs available. Maybe they are overlapping when we have the 2 rank case?
Notably this is much different than the clever setup that @henryleberre wrote for the Phoenix case: https://github.com/MFlowCode/MFC/blob/4f89f33739da7df6a74151afc7ef89c0f41f2bc9/.github/workflows/phoenix/test.sh#L10-L19