Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.16k stars 3.37k forks source link

How to save model weights to mlflow tracking server while using MLFLogger to save metrics. #741

Closed yudai09 closed 3 years ago

yudai09 commented 4 years ago

❓ Questions and Help

Anybody knows simple way to accomplish it?

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

I'm searching for a way to save model weights to mlflow tracking server while using MLFLogger to save metrics. my problem is, I cannot find a way to save model weight to same run which was created inside MLFLogger. When I run mlflow.pytorch.log_model() after trainer.fit(), metrics and model weight are saved to different run.

Code

``` mlf_logger = MLFlowLogger( experiment_name="WatchNetExperiment", ) trainer = pl.Trainer(gpus=hparams.gpus, distributed_backend='dp', min_nb_epochs=hparams.min_epochs, max_nb_epochs=hparams.max_epochs, logger=mlf_logger) trainer.fit(model) mlflow.pytorch.log_model(model.model, "my_model") ``` #### What have you tried? To work around the problem, I stopped to use MLFLogger and modify my training code to save metrics at `train_step()` and `validation_end`. #### What's your environment? - OS: Linux - Packaging conda - Version 0.5.3.2
williamFalcon commented 4 years ago

@neggert, @smurching, @dbczumar?

festeh commented 4 years ago

@yudai09 try mlflow.pytorch.log_model(model.model, "my_model", run_id=mfl_logger.run_id)

yudai09 commented 4 years ago

@festeh thank you. it works.

nsidn98 commented 4 years ago

@yudai09 try mlflow.pytorch.log_model(model.model, "my_model", run_id=mfl_logger.run_id)

This isn't working. Getting the following error:

Traceback (most recent call last):
  File "pythor/RL/Value/value_algos.py", line 276, in <module>
    main(args)
  File "pythor/RL/Value/value_algos.py", line 243, in main
    trainer.fit(model)
  File "/Users/opt/miniconda3/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 887, in fit
    self.run_pretrain_routine(model)
  File "/Users/opt/miniconda3/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1015, in run_pretrain_routine
    self.train()
  File "/Users/opt/miniconda3/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 376, in train
    self.run_training_teardown()
  File "/Users/opt/miniconda3/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 668, in run_training_teardown
    self.on_train_end()
  File "/Users/opt/miniconda3/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/callback_hook.py", line 53, in on_train_end
    callback.on_train_end(self, self.get_model())
  File "/Users/Downloads/PyThor/pythor/bots/rlCallback.py", line 51, in on_train_end
    mlflow.pytorch.log_model(pl_module.net, "my_model",run_id=pl_module.logger.run_id)
  File "/Users/opt/miniconda3/envs/py36/lib/python3.6/site-packages/mlflow/pytorch/__init__.py", line 158, in log_model
    registered_model_name=registered_model_name, **kwargs)
  File "/Users/opt/miniconda3/envs/py36/lib/python3.6/site-packages/mlflow/models/__init__.py", line 101, in log
    flavor.save_model(path=local_path, mlflow_model=mlflow_model, **kwargs)
  File "/Users/opt/miniconda3/envs/py36/lib/python3.6/site-packages/mlflow/pytorch/__init__.py", line 249, in save_model
    torch.save(pytorch_model, model_path, pickle_module=pickle_module, **kwargs)
TypeError: save() got an unexpected keyword argument 'run_id'
yudai09 commented 4 years ago

@nsidn98 sorry, I was wrong. I replied it works without confirming. After that, I tried it by myself and noticed it do not work and wrote the following code to workaround the problem. https://gitlab.com/chowagiken/mlflow_with_pytorch_lightning/-/blob/master/train.py#L43

    # log model to MLFLow tracking server
    with TemporaryDirectory() as tdname:
        pytorch_model_path = os.path.join(tdname, "my_model")
        mlflow.pytorch.save_model(module.model, pytorch_model_path)
        client = mlflow.tracking.MlflowClient()
        client.log_artifact(mlf_logger._run_id, pytorch_model_path)
  1. save model weights to temporal file path.
  2. create a new mlflow client
  3. save artifact via the client using run_id that was used during trainings.

I know it's just a workaround and not straight forward way to log the model weight.

jedrzejkozal commented 3 years ago

I think you can avoid creating TemporaryDirectory by calling log_model in context manger with proper run id set:

with mlflow.start_run(run_id=mlf_logger.run_id):
    mlflow.pytorch.log_model(model.model, "my_model")
davidefiocco commented 3 years ago

@yudai09 can you reopen this? This doesn't look solved to me. Current solutions seem workarounds. It will be trivial, but when using PL mind that in the tip by @jedrzejkozal one needs to set the tracking URI:

import mlflow
mlflow.set_tracking_uri('<my-tracking-uri>')
with mlflow.start_run(run_id=mlf_logger.run_id):
    mlflow.pytorch.log_model(model.model, "my_model")
zacharymostowsky commented 3 years ago

Hey all, @jedrzejkozal solution worked for myself however it did not do the automatic logging of the user and kernel in MLFlow. My solution was to explicitly set the run_id attribute on the MLFlowLogger. It seems like we should be able to pass this into __init__(). If you dig into the MLFlowLogger.experiment() method you'll find that the run_id will always be None the first time anything is logged.

experiment_name = 'test'
experiment = mlflow.get_experiment_by_name(experiment_name)
exp_id = experiment.experiment_id if experiment else mlflow.create_experiment(experiment_name)

with mlflow.start_run(experiment_id=exp_id) as run:
    run_id = run.info.run_id
    print(f'Run ID: {run_id}')

    mlf_logger = MLFlowLogger(
        experiment_name=experiment_name,
        tracking_uri=tracking_uri
    )

    # ** We need to explicitly set this to log to the same run opened in the notebook **
    mlf_logger._run_id = run_id

    trainer = pl.Trainer(logger=mlf_logger)

    trainer.fit(model, datamodule=data_module)  # Data Module defined outside this scope

I can then use the mlflow_logger in all my callbacks to do my logging. The training metrics are also logged using the mlflow_logger object.

# Example from a callback -
trainer.logger.log_hyperparams({'test': 1})
zacharymostowsky commented 3 years ago

I will add that I have been unable to use MLFlowLogger.log_model() or MLFlowLogger.experiment.log_model(). Neither method seems to exist. I currently have to use mlflow.pytorch.log_model() which logs to the correct run still but I would like to just use the MLFlowLogger over the mlflow library directly.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

pocca2048 commented 3 years ago

@zacharymostowsky self.logger.experiment returns a MlflowClient and log_model is in mlflow.pytorch.log_model(). So you can't use log_model there... I think mlflow documentation is more useful than pytorch-lightning documentation.

piseabhijeet commented 3 years ago

I will add that I have been unable to use MLFlowLogger.log_model() or MLFlowLogger.experiment.log_model(). Neither method seems to exist. I currently have to use mlflow.pytorch.log_model() which logs to the correct run still but I would like to just use the MLFlowLogger over the mlflow library directly.

Hi @zacharymostowsky

I agree with you. Also there are some pytorch lighning models which mlflow cannot log and gives me this error: image

i suspect models should be inherited from nn.module class and not pytorch lightning. Any idea?

ericjmcd commented 2 years ago

I just came across this issue now and this thread helped me to a better? solution. Digging through the code there is a check for MLFLOW_RUN_ID env var so I'm going with:

os.environ['MLFLOW_RUN_ID'] = trainer.logger.run_id  # Hack to force MLFlow to 'know' about this run
mlflow.pytorch.log_model(trainer.model.model, "models")

MLFlow's fluent.py has:

client = MlflowClient()
if run_id:
    existing_run_id = run_id
elif _RUN_ID_ENV_VAR in os.environ:   # _RUN_ID_ENV_VAR is defined as "MLFLOW_RUN_ID"
    existing_run_id = os.environ[_RUN_ID_ENV_VAR]
    del os.environ[_RUN_ID_ENV_VAR]
else:
    existing_run_id = None
ZhangHangjianMA commented 1 year ago

The solution from @zacharymostowsky worked for me. I can use MLFlowLogger from PL and mlflow.pytorch.log_model in the meanwhile to upload model on mlflow server, with setting the tracking uri and experiment id at the beginning.

    mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
    experiment = mlflow.get_experiment_by_name(MLFLOW_EXPERIMENT_NAME)
    exp_id = experiment.experiment_id if experiment else mlflow.create_experiment(MLFLOW_EXPERIMENT_NAME)

    with mlflow.start_run(experiment_id=exp_id) as run:
        run_id = run.info.run_id
        print(f'Run ID: {run_id}')

        mlflow_logger = MLFlowLogger(experiment_name=MLFLOW_EXPERIMENT_NAME, tracking_uri=MLFLOW_TRACKING_URI)
        mlflow_logger._run_id = run_id

        ....

        mlflow.pytorch.log_model(model, "model")
satyajitghana commented 1 year ago

use this

logger_ is your MLFlowLogger Instance

                ckpt = torch.load(ckpt_path)
                model.load_state_dict(ckpt["state_dict"])
                os.environ['MLFLOW_RUN_ID'] = logger_.run_id
                os.environ['MLFLOW_EXPERIMENT_ID'] = logger_.experiment_id
                os.environ['MLFLOW_EXPERIMENT_NAME'] = logger_._experiment_name
                os.environ['MLFLOW_TRACKING_URI'] = logger_._tracking_uri
                mlflow.pytorch.log_model(model, "model")