allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.42k stars 643 forks source link

Patches for Lightning have not kept up with backwards-incompatible changes #1249

Closed a-gardner1 closed 1 month ago

a-gardner1 commented 2 months ago

Describe the bug

While there has clearly been some effort to keep pace with changes to Lightning (see #1033), it has fallen behind since the initial patches were created (https://github.com/allegroai/clearml/commit/64e10b2f62f59244750e73604836b57470a2f0d7) and new versions of Lightning were released. Unfortunately, it silently fails to apply patches to model saving and restoration, which can hide the fact that model logging doesn't fully work as expected. One of the two related (and nearly duplicate) patch methods is shown below (linked here)

    @staticmethod
    def _patch_pytorch_lightning_io():
        if PatchPyTorchModelIO.__patched_pytorch_lightning:
            return

        if 'pytorch_lightning' not in sys.modules:
            return

        PatchPyTorchModelIO.__patched_pytorch_lightning = True

        # noinspection PyBroadException
        try:
            import pytorch_lightning  # noqa

            pytorch_lightning.trainer.Trainer.save_checkpoint = _patched_call(
                pytorch_lightning.trainer.Trainer.save_checkpoint, PatchPyTorchModelIO._save)  # noqa

            pytorch_lightning.trainer.Trainer.restore = _patched_call(
                pytorch_lightning.trainer.Trainer.restore, PatchPyTorchModelIO._load_from_obj)  # noqa
        except ImportError:
            pass
        except Exception:
            pass

        # noinspection PyBroadException
        try:
            import pytorch_lightning  # noqa

            # noinspection PyUnresolvedReferences
            pytorch_lightning.trainer.connectors.checkpoint_connector.CheckpointConnector.save_checkpoint = \
                _patched_call(
                    pytorch_lightning.trainer.connectors.checkpoint_connector.CheckpointConnector.save_checkpoint,
                    PatchPyTorchModelIO._save)  # noqa

            # noinspection PyUnresolvedReferences
            pytorch_lightning.trainer.connectors.checkpoint_connector.CheckpointConnector.restore = \
                _patched_call(
                    pytorch_lightning.trainer.connectors.checkpoint_connector.CheckpointConnector.restore,
                    PatchPyTorchModelIO._load_from_obj)  # noqa
        except ImportError:
            pass
        except Exception:
            pass

Three AttributeErrors exist in _patch_pytorch_lightning_io with newer versions of pytorch-lightning:

To reproduce

No reproduction is necessary. There are multiple clear AttributeErrors that get caught by the Exception handler depending on the pytorch-lightning version.

Expected behaviour

The checkpointing mechanism of pytorch-lightning should have been patched to enable automatic logging of models with ClearML.

Environment

ainoam commented 2 months ago

Thanks for letting us know @a-gardner1. We'll take a look and update on fix availability.

a-gardner1 commented 2 months ago

I can open a PR with a proposed fix if you like. I've already implemented one

ainoam commented 2 months ago

Contributions are most welcome @a-gardner1 🙂

a-gardner1 commented 1 month ago

@ainoam In case you missed it, I did open a PR. Let me know if anything looks off

ainoam commented 1 month ago

Thanks for the friendly nudge @a-gardner1. We'll try to address your PR soon.