Closed yudai09 closed 3 years ago
@neggert, @smurching, @dbczumar?
@yudai09 try mlflow.pytorch.log_model(model.model, "my_model", run_id=mfl_logger.run_id)
@festeh thank you. it works.
@yudai09 try
mlflow.pytorch.log_model(model.model, "my_model", run_id=mfl_logger.run_id)
This isn't working. Getting the following error:
Traceback (most recent call last):
File "pythor/RL/Value/value_algos.py", line 276, in <module>
main(args)
File "pythor/RL/Value/value_algos.py", line 243, in main
trainer.fit(model)
File "/Users/opt/miniconda3/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 887, in fit
self.run_pretrain_routine(model)
File "/Users/opt/miniconda3/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1015, in run_pretrain_routine
self.train()
File "/Users/opt/miniconda3/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 376, in train
self.run_training_teardown()
File "/Users/opt/miniconda3/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 668, in run_training_teardown
self.on_train_end()
File "/Users/opt/miniconda3/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/callback_hook.py", line 53, in on_train_end
callback.on_train_end(self, self.get_model())
File "/Users/Downloads/PyThor/pythor/bots/rlCallback.py", line 51, in on_train_end
mlflow.pytorch.log_model(pl_module.net, "my_model",run_id=pl_module.logger.run_id)
File "/Users/opt/miniconda3/envs/py36/lib/python3.6/site-packages/mlflow/pytorch/__init__.py", line 158, in log_model
registered_model_name=registered_model_name, **kwargs)
File "/Users/opt/miniconda3/envs/py36/lib/python3.6/site-packages/mlflow/models/__init__.py", line 101, in log
flavor.save_model(path=local_path, mlflow_model=mlflow_model, **kwargs)
File "/Users/opt/miniconda3/envs/py36/lib/python3.6/site-packages/mlflow/pytorch/__init__.py", line 249, in save_model
torch.save(pytorch_model, model_path, pickle_module=pickle_module, **kwargs)
TypeError: save() got an unexpected keyword argument 'run_id'
@nsidn98 sorry, I was wrong. I replied it works
without confirming.
After that, I tried it by myself and noticed it do not work and wrote the following code to workaround the problem.
https://gitlab.com/chowagiken/mlflow_with_pytorch_lightning/-/blob/master/train.py#L43
# log model to MLFLow tracking server
with TemporaryDirectory() as tdname:
pytorch_model_path = os.path.join(tdname, "my_model")
mlflow.pytorch.save_model(module.model, pytorch_model_path)
client = mlflow.tracking.MlflowClient()
client.log_artifact(mlf_logger._run_id, pytorch_model_path)
I know it's just a workaround and not straight forward way to log the model weight.
I think you can avoid creating TemporaryDirectory by calling log_model in context manger with proper run id set:
with mlflow.start_run(run_id=mlf_logger.run_id):
mlflow.pytorch.log_model(model.model, "my_model")
@yudai09 can you reopen this? This doesn't look solved to me. Current solutions seem workarounds. It will be trivial, but when using PL mind that in the tip by @jedrzejkozal one needs to set the tracking URI:
import mlflow mlflow.set_tracking_uri('<my-tracking-uri>') with mlflow.start_run(run_id=mlf_logger.run_id): mlflow.pytorch.log_model(model.model, "my_model")
Hey all, @jedrzejkozal solution worked for myself however it did not do the automatic logging of the user and kernel in MLFlow. My solution was to explicitly set the run_id
attribute on the MLFlowLogger
. It seems like we should be able to pass this into __init__()
. If you dig into the MLFlowLogger.experiment()
method you'll find that the run_id will always be None the first time anything is logged.
experiment_name = 'test'
experiment = mlflow.get_experiment_by_name(experiment_name)
exp_id = experiment.experiment_id if experiment else mlflow.create_experiment(experiment_name)
with mlflow.start_run(experiment_id=exp_id) as run:
run_id = run.info.run_id
print(f'Run ID: {run_id}')
mlf_logger = MLFlowLogger(
experiment_name=experiment_name,
tracking_uri=tracking_uri
)
# ** We need to explicitly set this to log to the same run opened in the notebook **
mlf_logger._run_id = run_id
trainer = pl.Trainer(logger=mlf_logger)
trainer.fit(model, datamodule=data_module) # Data Module defined outside this scope
I can then use the mlflow_logger
in all my callbacks to do my logging. The training metrics are also logged using the mlflow_logger
object.
# Example from a callback -
trainer.logger.log_hyperparams({'test': 1})
I will add that I have been unable to use MLFlowLogger.log_model()
or MLFlowLogger.experiment.log_model()
. Neither method seems to exist. I currently have to use mlflow.pytorch.log_model()
which logs to the correct run still but I would like to just use the MLFlowLogger
over the mlflow
library directly.
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!
@zacharymostowsky self.logger.experiment
returns a MlflowClient
and log_model
is in mlflow.pytorch.log_model()
. So you can't use log_model
there...
I think mlflow documentation is more useful than pytorch-lightning documentation.
I will add that I have been unable to use
MLFlowLogger.log_model()
orMLFlowLogger.experiment.log_model()
. Neither method seems to exist. I currently have to usemlflow.pytorch.log_model()
which logs to the correct run still but I would like to just use theMLFlowLogger
over themlflow
library directly.
Hi @zacharymostowsky
I agree with you. Also there are some pytorch lighning models which mlflow cannot log and gives me this error:
i suspect models should be inherited from nn.module class and not pytorch lightning. Any idea?
I just came across this issue now and this thread helped me to a better? solution. Digging through the code there is a check for MLFLOW_RUN_ID env var so I'm going with:
os.environ['MLFLOW_RUN_ID'] = trainer.logger.run_id # Hack to force MLFlow to 'know' about this run
mlflow.pytorch.log_model(trainer.model.model, "models")
MLFlow's fluent.py has:
client = MlflowClient()
if run_id:
existing_run_id = run_id
elif _RUN_ID_ENV_VAR in os.environ: # _RUN_ID_ENV_VAR is defined as "MLFLOW_RUN_ID"
existing_run_id = os.environ[_RUN_ID_ENV_VAR]
del os.environ[_RUN_ID_ENV_VAR]
else:
existing_run_id = None
The solution from @zacharymostowsky worked for me. I can use MLFlowLogger from PL and mlflow.pytorch.log_model in the meanwhile to upload model on mlflow server, with setting the tracking uri and experiment id at the beginning.
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
experiment = mlflow.get_experiment_by_name(MLFLOW_EXPERIMENT_NAME)
exp_id = experiment.experiment_id if experiment else mlflow.create_experiment(MLFLOW_EXPERIMENT_NAME)
with mlflow.start_run(experiment_id=exp_id) as run:
run_id = run.info.run_id
print(f'Run ID: {run_id}')
mlflow_logger = MLFlowLogger(experiment_name=MLFLOW_EXPERIMENT_NAME, tracking_uri=MLFLOW_TRACKING_URI)
mlflow_logger._run_id = run_id
....
mlflow.pytorch.log_model(model, "model")
use this
logger_
is your MLFlowLogger
Instance
ckpt = torch.load(ckpt_path)
model.load_state_dict(ckpt["state_dict"])
os.environ['MLFLOW_RUN_ID'] = logger_.run_id
os.environ['MLFLOW_EXPERIMENT_ID'] = logger_.experiment_id
os.environ['MLFLOW_EXPERIMENT_NAME'] = logger_._experiment_name
os.environ['MLFLOW_TRACKING_URI'] = logger_._tracking_uri
mlflow.pytorch.log_model(model, "model")
❓ Questions and Help
Anybody knows simple way to accomplish it?
Before asking:
What is your question?
I'm searching for a way to save model weights to mlflow tracking server while using
MLFLogger
to save metrics. my problem is, I cannot find a way to save model weight to samerun
which was created inside MLFLogger. When I runmlflow.pytorch.log_model()
aftertrainer.fit()
, metrics and model weight are saved to different run.Code
``` mlf_logger = MLFlowLogger( experiment_name="WatchNetExperiment", ) trainer = pl.Trainer(gpus=hparams.gpus, distributed_backend='dp', min_nb_epochs=hparams.min_epochs, max_nb_epochs=hparams.max_epochs, logger=mlf_logger) trainer.fit(model) mlflow.pytorch.log_model(model.model, "my_model") ``` #### What have you tried? To work around the problem, I stopped to use MLFLogger and modify my training code to save metrics at `train_step()` and `validation_end`. #### What's your environment? - OS: Linux - Packaging conda - Version 0.5.3.2