Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.08k stars 3.36k forks source link

MLFlowLogger used with server crashes training #12833

Open GinkoBalboa opened 2 years ago

GinkoBalboa commented 2 years ago

πŸ› Bug

Exceptions in MLFlowLogger are not caught, these produce the break of the training and loss of the model. This happens most often when using a server to log data. Since the MLFlowLogger uses MLFlow, and MLFlow has an option to log data over a server, exceptions can happen for various reasons concerning the third party server components.

To Reproduce

In order to reproduce the bug most effectively, you must run a mlflow server with some database. I used MySQL database and the server uses the local HTTP protocol.

Prepare the MySQL database and run the server

  1. Install MySQL in your system.

    sudo apt install mysql-server
  2. Create a database to use as an MLflow backend tracking server.

CREATE DATABASE mlflow_tracking;

Under Linux mlflow server can be run with the following command:

user@pc:~/ModelFolder$ mlflow server --backend-store-uri mysql+pymysql://root@localhost/mlflow_tracking --default-artifact-root file:/home/user/ModelFolder/mlruns  -h 0.0.0.0 -p 5000

Info:

In order to run locally the server, you need to disable passwords for the MySQL user root@localhost.

You can use the following gist link to the training script (it is based on the default bug_report_model.ipynb):

bug_report_model_mlflowlogger_server.ipynb

Expected behavior

Whenever the server is not working the training crashes.

Proposed solution

I've proposed a solution to the given problem by catching exceptions. I've tested three scenarios with the server:

  1. Server is turned off, and the script is started.
  2. Server is on, the script is started, and the server is killed during the training.
  3. Server is on, the script is started, and the server is killed after the first training finishes but before it writes to the database. For this, I used a sleep timer of 60 sec. in the script.

I've investigated these three scenarios since the exception is caught in different functions so a solution to only one scenario is not sufficient.

Additional context

Similar issues were raised previously:

NOTE: PR with the proposed solution will follow quickly.

cc @borda

awaelchli commented 2 years ago

@GinkoBalboa Hello, thanks for reporting. I am not convinced about the solution you propose. It essentially ignores all exceptions and redirects them to standard output. This is not normally how we handle exceptions.

exceptions can happen for various reasons concerning the third party server components.

Whenever the server is not working the training crashes.

It would be good to know what exception is being raised here and catch this one specifically. Also, it would be good to know if there is any recommendations from MLflow directly how these connection issues should get handled (outside PL). Is this something people run into frequently and wrap their whole code into try-except all exceptions? I doubt that. Finally, I think it would be the best if MLflow supported ways of configuring this directly in the client, for example via an option to warn instead of error.

GinkoBalboa commented 2 years ago

The problem that I have is that the PL doesn't handle exceptions, while internally uses 3rd party code that throws exceptions. In my humble opinion, the exceptions mechanism is generally conceived to alarm about some malfunction without breaking the whole application if possible. I think here, PL, and MLFlow of any other applications shouldn't stand in the way of model training. If the server doesn't work, this shouldn't be a reason to break training (which can take hours and even days). I agree with you, this is not a good fix, it is not a fix at all - but this dirty patch/override makes a difference for me in the sense to use or don't use PL. I just wanted to point that out and help the maintainers improve their code.

stackedsax commented 2 years ago

@awaelchli thoughts on @GinkoBalboa's thoughts? Is there a better way to address the issue that @GinkoBalboa is running into? Like he suggests, it would be great to not fail long training runs on something like this.

awaelchli commented 2 years ago

Hi Thanks for the replies. I agree there is a user need and I definitely think that connection errors or server-side errors should not cancel the long running model training run. The only thing I'm saying is that I am not sure this should be implemented in user code / PyTorch Lightning but rather directly be configureable by the MLFlowClient directly, as your request applies to any MLFlow user not just Lightning users. Does this make sense? I would prefer if we opened a second issue on the MLflow GitHub to get the opinions of the MLFlow devs as well :)

GinkoBalboa commented 2 years ago

Hi Thanks for the replies. I agree there is a user need and I definitely think that connection errors or server-side errors should not cancel the long running model training run. The only thing I'm saying is that I am not sure this should be implemented in user code / PyTorch Lightning but rather directly be configureable by the MLFlowClient directly, as your request applies to any MLFlow user not just Lightning users. Does this make sense? I would prefer if we opened a second issue on the MLflow GitHub to get the opinions of the MLFlow devs as well :)

MLFlow sends exceptions. It is Lightning that does nothing about them. So in the manner of progressing exceptions I think MLFlow allready does what it needs to be done. They end up in Lightning which doesn't deal with them but breaks the run. That is why I've made the most simple patch (just catch exceptions) in order to point out a missing functionality in Lightning.

Borda commented 1 year ago

MLFlow sends exceptions. It is Lightning that does nothing about them. So in the manner of progressing exceptions I think MLFlow allready does what it needs to be done. They end up in Lightning which doesn't deal with them but breaks the run. That is why I've made the most simple patch (just catch exceptions) in order to point out a missing functionality in Lightning.

I think it would be quite challenging to male your training loop and listen to logger exceptions... on one side it could useful but also it will increase the complexity of a particular logger. Or do you have an idea how this could be implemented in a minimalistic way?

markspec commented 5 days ago

This is by far our biggest issue with training in lightning. Our MLFlow server is remote and whenever there is a communication issue with MLFlow it takes down the training jobs. Having an option to continue training and ignore any issues with logging would greatly improve my life! The proposed PR while sub-optimal would fix my problem. I think having an option to ignore exceptions in the logger is reasonable.