ectrans result is not reproducible on GPU when NPROC changes

pmarguinaud commented 2 months ago

Apparently changing NPROC changes numerical results when running on NVIDIA accelerators.

Is this expected ? If so, is it investigated ?

I can provide a small test case if necessary.

lukasm91 commented 2 months ago

Hi Philippe

Yes, this is expected. It is rather unlikely that we can have reproducible results with different NPROC due to the batched FFTs and especially batched GEMMs. The GEMMs run on multiple layers at once, so it depends on the exact number of layers per rank.

What is the use-case here? Is this a production requirement, or a debugging requirement? Depending on this, I would recommend

running the CPU version if debugging only, if you need reproducible results in a different component
running with a fixed NPRTRV should be reproducible (or let's say, very likely we could make it reproducible). On any run, you could go down to NPRTRV ranks, i.e. if NPRTRV=1, it would be reproducible with 1 rank, in theory
it might be possible to implement a slow version for GEMMs/FFTs by just iterating instead of doing batched GEMM. IMO it is questionable if this is useful, because this is really slow, and it might only be useful for debugging purposes, i.e. one might also use the CPU version in this case.

Any thoughts?

pmarguinaud commented 2 months ago

Hello Lukas,

Thank you for these explanations.

Currently we regulary control the reproducibility of our models (ARPEGE & AROME); and it proves quite useful when we need to debug the model, as we can reduce the number of nodes and still reproduce a problem.

It is also something we demand when writing specifications for buying a new machine.

Apparently, everything in ARPEGE but the spectral transforms is reproducible when the number of MPI tasks changes.

But I am not alone to decide on these matters, so I will talk about this with other Météo-France colleagues.

I would also be curious to hear ECWMF opinion on this matter.

marsdeno commented 2 months ago

As Lukas mentioned, this has been the case for some time due to the batched maths. My thoughts on this :

although the ability to run with task-count-independent results is an important debugging feature, I believe we do not run operationally with this mode activated
the task-count independence of results, or at least the ability to run in such a mode, should be maintained going forwards in the CPU codepath in ectrans
with these points said, I think in a large GPU-enabled run to debug we would be ok with a multi-step process : if bug can't be triggered with CPU ectrans, then most likely bug in ectrans, if it can, we regain task-count independence allowing debugging on smaller node count

Two more points that should help down the line for this

we are heading towards a unified ectrans library which allows dispatching to GPU or CPU at runtime
ectrans testing should be improved,for correctness checking of both CPU and GPU code paths

ecmwf-ifs / ectrans

ectrans result is not reproducible on GPU when NPROC changes #144