Add a little documentation for A100 use

Is your feature request related to a problem? Please describe. Since Torch2, the Shivam Colab fails, at least for me, 3/4 of the time on a basic T4. The Colab was established as a tightly-optimized script with no margin for error, and the advent of Torch 2 has (in my experience) made a successful training on a T4, on this Colab, now relatively precarious.

For those of us, like me, who are solving the problem by spending out on an A100 instance, there is no comprehensive documentation that explains how to set the script up for an A100 instead of a T4.

Describe the solution you'd like I would like the Colab to contain 1 paragraph of extra advice for use on an A100, which at least mentions useful but unreferenced flags such as '--not_cache_latents', and any other flags that would help the user to maximize use of the A100, but which are not actually present currently, either as commented-out code or in the help text.

Describe alternatives you've considered I have trawled Reddit, Discord, Google and even PMs with Shivam, in search of a comprehensive guide to rejigging the settings for an A100, but the advice is scrappy, and the best place for a definitive para of advice is in the Colab itself.

It should also be mentioned that actions such as not using FP16 (i.e., setting the flag to 'no') will likely slow down training considerably, as might deleting 8-Bit Adam flag from the training parameters.

The help text could put forward the pros and cons of these changes.

ShivamShrirao / diffusers

Add a little documentation for A100 use #229