ecmwf-ifs / ectrans

Global spherical harmonics transforms library underpinning the IFS
Apache License 2.0
18 stars 35 forks source link

ectrans result is not reproducible on GPU when NPROC changes #144

Open pmarguinaud opened 2 months ago

pmarguinaud commented 2 months ago

Apparently changing NPROC changes numerical results when running on NVIDIA accelerators.

Is this expected ? If so, is it investigated ?

I can provide a small test case if necessary.

lukasm91 commented 2 months ago

Hi Philippe

Yes, this is expected. It is rather unlikely that we can have reproducible results with different NPROC due to the batched FFTs and especially batched GEMMs. The GEMMs run on multiple layers at once, so it depends on the exact number of layers per rank.

What is the use-case here? Is this a production requirement, or a debugging requirement? Depending on this, I would recommend

Any thoughts?

pmarguinaud commented 2 months ago

Hello Lukas,

Thank you for these explanations.

Currently we regulary control the reproducibility of our models (ARPEGE & AROME); and it proves quite useful when we need to debug the model, as we can reduce the number of nodes and still reproduce a problem.

It is also something we demand when writing specifications for buying a new machine.

Apparently, everything in ARPEGE but the spectral transforms is reproducible when the number of MPI tasks changes.

But I am not alone to decide on these matters, so I will talk about this with other Météo-France colleagues.

I would also be curious to hear ECWMF opinion on this matter.

marsdeno commented 2 months ago

As Lukas mentioned, this has been the case for some time due to the batched maths. My thoughts on this :

Two more points that should help down the line for this