Chris-hughes10 / pytorch-accelerated

A lightweight library designed to accelerate the process of training PyTorch models by providing a minimal, but extensible training loop which is flexible enough to handle the majority of use cases, and capable of utilizing different hardware options with no code changes required. Docs: https://pytorch-accelerated.readthedocs.io/en/latest/
Apache License 2.0
174 stars 21 forks source link

Do we need to set mixed-precision explicitly or is it handled if tensor cores available? #19

Closed sayakpaul closed 2 years ago

sayakpaul commented 2 years ago

I following your awesome guide on timm: https://towardsdatascience.com/getting-started-with-pytorch-image-models-timm-a-practitioners-guide-4e77b4bf9055.

I am running training on an A100-based VM which should support mixed-precision training. Does Trainer from PyTorch Accelerated take care of that automatically?

Chris-hughes10 commented 2 years ago

Hi, thank you for the kind words! The Trainer will handle mixed precision for you, so there will be no changes needed to your code, as long as it is set in your accelerate config when launching training.

As described here, you can create a config file using the command accelerate config --config_file accelerate_config.yaml. You can verify that mixed precision is set by inspecting the config file which, depending on your answers, should look something like this:

image

If you then launch with that config, using: accelerate launch --config_file accelerate_config.yaml train.py [--training-args]

mixed precision will be enabled!

sayakpaul commented 2 years ago

Thank you!

I am using a similar main.py script as yours (shown in your tutorial) so I assume there won't be a need to launch with config file or am I mistaken?

Chris-hughes10 commented 2 years ago

Hi, if you launch without a config file, it will assume default settings, which will use 1GPU and mixed precision will be disabled.

The idea behind the config file and the accelerate cli is that you have a consistent launch command regardless of your infrastructure; you don't need to bother with python -m torch.distributed .. and all of that stuff. If you want to use multiple GPUs, or toggle fp16, then all you need to do is update that file, and not the code. Of course, launching without the accelerate CLI will still work, but you will need to make some code changes to enable mixed precision.

Personally, I would recommend always launching with accelerate launch and the config file, as it offers the most flexibility with the least amount of effort.

sayakpaul commented 2 years ago

Thanks for clarifying, that helps.

Of course, launching without the accelerate CLI will still work, but you will need to make some code changes to enable mixed precision.

Could you also hint at how those changes might look like? It's good to also be aware of that I think.

Chris-hughes10 commented 2 years ago

No problem. If you would like to enable mixed precision while launching with python main.py you can do so in two ways:

  1. The easiest is to set the Environment Variable USE_FP16 as 'true' (or 'True', it doesn't matter). This is what the launcher does for you based on the value in your config file.
  2. Hard code the value in the Trainer so that it always uses mixed precision (this is not really recommended though)

If you want to go ahead with option 2, you will need to update the Trainer's accelerator object, which is what handles moving data between devices. You can override the create_accelerator method to do this. Here is an example of how this would look:

from accelerate import Accelerator
from pytorch_accelerated import Trainer

class Fp16Trainer(Trainer):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def _create_accelerator(self):
        return Accelerator(fp16=True)

That seems a lot of effort for such a small change though, so it wasn't really the intended approach for this!

sayakpaul commented 2 years ago

Thanks so much for being thorough with your explanations. And yes it makes sense to use config files to launch training whenever possible.