Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.92k stars 3.34k forks source link

Should Fabric have a `.mark_step()`? #17622

Open awaelchli opened 1 year ago

awaelchli commented 1 year ago

Description & Motivation

In #265, the inference code gets sped up by inserting the XLA mark_step() calls. Should Fabric provide an accelerator/strategy-agnostic method for that?

Pitch

It could be implemented directly like this in the Fabric class:

def mark_step(self):
    if self.device.type == "xla":
        import torch_xla.core.xla_model as xm

        xm.mark_step()
    # for all other accelerators, this is a no-op

The convenience method enables you to call

fabric.mark_step() directly in the loop and no code changes are required once switching to another accelerator or strategy.

Alternatives

The above proposes to add the logic directly to the Fabric object since it is very simple and concerns only the XLA strategy and to keep the base strategy interface lean. If this turns out to generalize and be required for other accelerators in the future, we could then decide to add the interface to the strategy base class.

Additional context

No response

cc @borda @carmocca @justusschock @awaelchli

carmocca commented 1 year ago

AFAIK, only XLA's and Habana's strategies use this.

The only downside I see is that people unfamiliar with this detail of these 2 strategies who are reading a script (lit-llama generate.py for example) could get the idea that calling this is required in general. On the other hand, having it under an if device.type == "xla" explicitly shows that this is just for XLA support.

If done, I would strongly suggest that it is defined in the strategy. Fabric can add a shortcut for it of course.

carmocca commented 1 year ago

cc @lantiga who also suggested this

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!