Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.53k stars 3.39k forks source link

`torch.backends.cudnn.allow_tf32` is no longer on by default #18665

Closed stas00 closed 1 year ago

stas00 commented 1 year ago

Bug description

~This note is incorrect since pt-1.12~ edit - my observation was wrong

https://github.com/Lightning-AI/lightning/blob/363da4aa8539efe8d81ac5b3937b7fe4c8efe4fa/src/lightning/fabric/accelerators/cuda.py#L362-L363

I see the code above is already somewhat aware of it but I'm not sure what the original intention was - did you mean to set it to True but weren't before pt-1.12 because it was the default?

What version are you seeing the problem on?

master

cc @borda @justusschock @awaelchli

awaelchli commented 1 year ago

My understanding is that our comment there cites the docs from PyTorch:

# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
torch.backends.cudnn.allow_tf32 = True

It is true by default for me on A100 (pt 2.0), it seems correct. Is there an edge case we are not aware of?

stas00 commented 1 year ago

That was a blunder on my part - I read cudnn as matmul - guh! my apologies, Adrian!

I tried to find any docs on PTL/tf32 but there are none that I found.

Searching the source code I found:

https://github.com/Lightning-AI/lightning/blob/363da4aa8539efe8d81ac5b3937b7fe4c8efe4fa/src/lightning/fabric/accelerators/cuda.py#L356C1-L361C10

The reason pytorch dropped tf32 as True for matmul in 1.12 was because pytorch wasn't used only for ML training but also for scientific use cases and there high precision is crucial. But it's not the case for ML work loads, since it's an adaptive process.

So it looks like bottom line is that the ML user of PTL should manually set:

torch.backends.cuda.matmul.allow_tf32 = True
# or
# torch.set_float32_matmul_precision("high")

if they want a significantly faster speed for fp32 precision.

It'll be much less of an impact in mixed-precision and mostly no impact in half-precision.

Also as the log is written it looks like recently pytorch added the "medium" precision but haven't updated the docs to include benchmarks/errors here but I'd imagine it'd be even faster and lossier.

I don't know if anybody has done any measurement of the actual speedup in a general, say, gpt, model. But perhaps it's worthwhile to document in PTL docs as I have now launched nemo with PTL probably many dozens of times and only now I went searching for this log and found it - I didn't see it before in a flurry of thousands of logs.

awaelchli commented 1 year ago

Yes, I think it's a good idea. It could go to our page for general advice on speeding up training: https://lightning.ai/docs/pytorch/stable/advanced/speed.html#training-on-accelerators

Your suggestion was also planned for the dedicated performance page here: #12398 but nobody got to work on it so far.

stas00 commented 1 year ago

https://github.com/Lightning-AI/lightning/issues/12398 is 1.5 years old - doesn't look it will happen any time soon.

Might it be a good idea to just add quick notes and then improve down the road?

I'm also asking the pytorch team to update their overview docs here https://github.com/pytorch/pytorch/issues/110252 so another strategy would be to just send PTL users to that section once it's updated from the doc you linked to, Adrian.

But I'm also curious of whether there is really much of an impact of TF32 for non-fp32 precision training - I did benchmark the impact some time back here:

and of course a huge impact for pure-fp32 training.