apex FP16 causes CUDA Memory Overflow with multiple model runs (hyperparameter tuning) -- bypass with FP32

AlexMRuch commented 4 years ago

Describe the bug apex seems to generate a GPU memory leak when using FP16 training/evaluation: https://github.com/NVIDIA/apex/issues/439. This is not a problem in most single-run cases; however, if you are using something like optuna to do hyperparameter tuning, the memory leak will throw a CUDA Memory Overflow error in about 5-10 trials depending on the language model.

To circumvent this issue, users can set fp16 to False and set fp16_opt_level to O0 (FP32). This avoids the memory leak and allows users to do automated hyperparameter tuning without the CUDA Memory Overflow error.

I suggest documentation be updated to note this bug: "Some users experience CUDA Memory Overflow errors when using automated hyperparameter tuning to run multiple versions of their model back-to-back. This can be avoided by setting fp16=False and fp16_opt_level=O0 to use FP32 instead."

Hope this helps other simpletransformer users avoid this issue!

To Reproduce

"""This script trains a multilabel classification model on MFTC data
To run code:
conda activate transformers
mlflow run . -e optuna --no-conda
"""

# Import dependencies
import os
import shutil
import argparse
import gc
import json
from pathlib import Path
from pprint import pprint as pp
from urllib.parse import urlparse
import logging
import time
import datetime
import numpy as np
import pandas as pd
import re
import unidecode
from ast import literal_eval
import torch
from simpletransformers.classification import MultiLabelClassificationModel
import mlflow
import mlflow.pytorch
from mlflow import log_metric, log_param, log_artifact
from mlflow.tracking import MlflowClient
import optuna

# Create memory tracker function
def memReport():
    for obj in gc.get_objects():
        if torch.is_tensor(obj):
            print(type(obj), obj.size())

# Optimization objective pipeline
def objective(trial):
    global args
    global trial_args
    global trial_no
    global best_study_trial_no
    global best_study_eval_loss
    global best_study_eval_LRAP
    global best_trial_eval_LRAP

    ## Initialize arguments for multiLabel classification model
    model_args = {
        'reprocess_input_data': True,
        'do_lower_case': False, #Using cased model
        'use_multiprocessing': True,
        'n_gpu': args.n_gpu, #Using two GPUs causes CUDA Apex problems
        'fp16': args.fp16,
        'fp16_opt_level': args.fp16_opt_level, #https://nvidia.github.io/apex/amp.html#o1-mixed-precision-recommended-for-typical-use

        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-6, 1e-2),
        'weight_decay': trial.suggest_loguniform('weight_decay', 1e-10, 1e-2),
        'adam_epsilon': trial.suggest_loguniform('adam_epsilon', 1e-10, 1e-6),
        'max_grad_norm': trial.suggest_loguniform('max_grad_norm', 1.0, 5.0),
        'gradient_accumulation_steps': 1,

        'max_seq_length': 512,
        'train_batch_size': 8,
        'warmup_ratio': 0.05,
        'num_train_epochs': args.num_train_epochs,
        'save_model_every_epoch': False, #Blows up disk memory
        'no_save': False, #Save run models, but del after run if not best

        'eval_batch_size': 8,
        'evaluate_during_training': True,
        'evaluate_during_training_steps': 1000,
        'evaluate_during_training_verbose': True,
        'use_cached_eval_features': True,
        'save_eval_checkpoints': False, #Blows up disk memory
        'save_optimizer_and_scheduler': True,
        'save_steps': 1000,
        'no_cache': False,

        'use_early_stopping': True,
        'early_stopping_patience': 3,
        'early_stopping_delta': 0.01,
        'early_stopping_metric': 'eval_loss',
        'early_stopping_metric_minimize': True,

        'cache_dir': 'transformers_tmp_cache/',
        'output_dir': 'transformers_tmp_outputs/',
        'overwrite_output_dir': True,
        'best_model_dir': 'transformers_tmp_outputs/transformers_best_model_outputs/',
        'tensorboard_dir': 'transformers_tmp_outputs/runs/',
        'logging_steps': 50,
        'manual_seed': 407
    }
    print("Model args:\n", model_args)
    trial_args = model_args

    ## Create a multiLabel classification model
    print("\n********************************************************************************")
    print("********************************************************************************")
    print("Creating multiLabel classification model...")
    model = MultiLabelClassificationModel(
        'roberta', #https://huggingface.co/models
        'transformers_finetune_outputs/best_model/', #Use fine-tuned model
        num_labels = 11,
        args = model_args
    )
    print("Created multiLabel classification model")

    # Train Model
    ## Train the model
    print("********************************************************************************")
    print("Begin model training...")
    model.train_model(
        train_df = df_train_split,
        eval_df = df_test_split
    )
    print("Completed model training...")
    print("********************************************************************************")
    print("********************************************************************************\n")

    # Evaluate Model
    ## Load best model
    print("\n********************************************************************************")
    print("********************************************************************************")
    print("Loading best model from trial...")
    model = MultiLabelClassificationModel("roberta", "transformers_tmp_outputs/transformers_best_model_outputs/")
    print("Loaded best model from trial")
    ## Evaluate the model
    ## Note: https://simpletransformers.ai/docs/usage/#additional-evaluation-metrics
    print("********************************************************************************")
    print("Begin model evaluation...")
    result, model_outputs, wrong_predictions = model.eval_model(df_test_split)
    print("Completed model evaluation...")
    print("********************************************************************************")
    print("Results:", result)
    print("Example output:\n", model_outputs[:5])
    print("********************************************************************************")
    print("********************************************************************************\n")

    # Copy cache/outputs if model is best in study_name
    if result['eval_loss'] < best_study_eval_loss:
        print(f"\n***** NEW BEST STUDY ~ WOOT! ~ --> Saving best study model... *****")
        best_study_trial_no = trial_no
        best_study_eval_loss = result['eval_loss']
        best_study_eval_LRAP = result['LRAP']
        if os.path.exists("transformers_best_cache/"):
            shutil.rmtree("transformers_best_cache/")
        if os.path.exists("transformers_best_outputs/"):
            shutil.rmtree("transformers_best_outputs/")
        shutil.copytree('transformers_tmp_cache/', 'transformers_best_cache/')
        shutil.copytree('transformers_tmp_outputs/', 'transformers_best_outputs/')
        print("  --> Saved best model/cache to transformers_best_outputs/ and transformers_best_cache/")

    # Finish training: cleanup GPU
    print("\nFinished training: removing cache/outputs and clearing GPU...\n")
    del model
    del model_outputs
    del wrong_predictions
    del model_args
    shutil.rmtree('transformers_tmp_cache/')
    shutil.rmtree('transformers_tmp_outputs/')
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()

    # Return results (eval_loss vs LRAP)
    best_trial_eval_LRAP = result['LRAP']
    return result['eval_loss']

if __name__ == '__main__':
    # Start script
    ## Get date at runtime
    runtime_data = datetime.date.today()

    ## Set study name
    studyname = "mftclassifier"

    # argparse has dumb bool logic, so make a str2bool fcn: https://bugs.python.org/issue26994
    def str2bool(v):
        # https://stackoverflow.com/questions/15008758/parsing-boolean-values-with-argparse
        if isinstance(v, bool):
            return v
        if v.lower() in ('yes', 'true', 't', 'y', '1'):
            return True
        elif v.lower() in ('no', 'false', 'f', 'n', '0'):
            return False
        else:
            raise argparse.ArgumentTypeError('Boolean value expected.')

    # Initialize sys.argv model pipeline arguments/parameters
    print("\nInitializing sys.argv model pipeline arguments/parameters...")
    parser = argparse.ArgumentParser(description='transformer')
    parser.add_argument("--fp16", type=str2bool, default=True,
        help="Use 16 point floating precision, default: True [False == fp32]")
    parser.add_argument("--fp16_opt_level", type=str, default="O1",
        help="Apex optimization level for 16 floating point precision, default: O1 [O0 == fp32]")
    parser.add_argument("-ne", "--num_train_epochs", type=int, default=100,
        help="number of training epochs, default: 100")
    parser.add_argument("--n_gpu", type=int, default=1,
        help="number of gpus [set to 1 or 2], default: 1 [gpu 0], cpu: 0, 2 == Apex errors")
    parser.add_argument("--max_runs", type=int, default=100,
        help="max number of hyperparameter tuning trials to run, default: 100")
    args = parser.parse_args()
    print(args)

    # Helper function to run/track hyperparameter tuning trials
    exp_id = None
    def mlflow_callback(study, trial):
        global args
        global exp_id
        global trial_args
        global trial_no
        global log_data_param_dict
        global best_trial_eval_LRAP
        with mlflow.start_run(run_name=study.study_name+str(trial_no), nested=True) as run:
            print("MLflow tracking URI: ", mlflow.get_tracking_uri())
            exp_id = mlflow.active_run().info.experiment_id
            print("MLflow experiment id:", exp_id)
            run_id = mlflow.active_run().info.run_id
            print("MLflow run number:   ", run_id)
            ## Log study goals and data parameters
            mlflow.set_tag(
                "Goal_brief",
                "Hyperparameter tuning for multilabel classification of moral sentiments with DistilRoBERTa"
            )
            mlflow.set_tag(
                "Goal_full",
                """Run hyperparameter tuning (with Optuna) for a DistilRoBERTa transformer model
                for multilabel classification of moral sentiments from the MFTC dataset. Training
                and validation data are stored at /media/seagate0/amazon/data/MFTC_*_split.csv."""
            )
            mlflow.log_params(log_data_param_dict)
            mlflow.log_params(trial_args)
            trial_value = trial.value if trial.value is not None else float("nan")
            mlflow.log_metric("Best_trial_val_loss", trial_value)
            mlflow.log_metric("Best_trial_val_LRAP", best_trial_eval_LRAP)
            trial_no += 1

    # Setup Logging/Tracking/Seeds/GPU
    ## Setup logging
    logging.basicConfig(level=logging.INFO)
    transformers_logger = logging.getLogger(studyname)
    transformers_logger.setLevel(logging.WARNING)
    ## Setup seeds
    random_seed = 407
    np.random.seed(random_seed)
    torch.manual_seed(random_seed)
    ## Setup GPU
    use_gpu = torch.cuda.is_available()
    print("Setup Logging/Tracking/Seeds/GPU info:")
    print("  random_seed:", random_seed)
    print("  use_gpu:    ", use_gpu)

    # Load Cleaned, Preprocessed, Split Data
    ## Initialize data dictionary for MLflow tracking
    log_data_param_dict = {}
    ## Set datasource
    datasource = '/media/seagate0/amazon/'
    ## Load data
    print("\nLoading processed training and validation data from server...")
    df_train_split = pd.read_csv(datasource+"data/MFTC_train_split.csv", encoding='utf-8', engine='python')
    df_train_split["labels"] = df_train_split["labels"].apply(literal_eval) #pd .csv converts list to str --> undo
    print("\ndf_train_split.shape:", df_train_split.shape)
    print(df_train_split.head())
    log_data_param_dict["df_train_split_datasource"] = datasource+"data/MFTC_train_split.csv"
    log_data_param_dict["df_train_split_shape"] = df_train_split.shape
    df_test_split = pd.read_csv(datasource+"data/MFTC_test_split.csv", encoding='utf-8', engine='python')
    df_test_split["labels"] = df_test_split["labels"].apply(literal_eval)
    print("\ndf_test_split.shape:", df_test_split.shape)
    print(df_test_split.head())
    log_data_param_dict["df_test_split_datasource"] = datasource+"data/MFTC_test_split.csv"
    log_data_param_dict["df_test_split_shape"] = df_test_split.shape
    ## Setup other helpful variables
    print("\nLoading labels...")
    labels = [
        "care",
        "harm",
        "fairness",
        "cheating",
        "loyalty",
        "betrayal",
        "authority",
        "subversion",
        "purity",
        "degradation",
        "non-moral"
    ]
    print(f"  Labels (n = {len(labels)}): {labels}")
    log_data_param_dict["labels"] = labels

    # Run optimization model pipeline
    print("\nRunning optimization model pipeline...")
    best_study_eval_loss = 99**99
    best_study_eval_LRAP = 0
    best_trial_eval_LRAP = 0
    best_study_trial_no = ""
    trial_args = {}
    trial_no = 0
    study = optuna.create_study(
        study_name = studyname,
        direction = "minimize",
        sampler = optuna.samplers.TPESampler(seed = random_seed)
    ) # no pruner because intermediate results not captured
    study.optimize(
        objective,
        n_trials = args.max_runs,
        n_jobs = 1,
        callbacks = [mlflow_callback],
        catch = (RuntimeError,) #pass CUDA Memory Overflow
    )
    print("Finished running optimization model pipeline")

    # Evaluate optimization trials
    study_stats_dict = {}
    print("Study statistics:")
    print("  Number of finished trials:", len(study.trials))
    study_stats_dict["Number_finished_trials"] = len(study.trials)
    trials_completed = [t for t in study.trials if t.state == optuna.trial.TrialState.COMPLETE]
    print("  Number of complete trials:", len(trials_completed))
    study_stats_dict["Number_complete_trials"] = len(trials_completed)
    try:
        trials_pruned = [t for t in study.trials if t.state == optuna.trial.TrialState.PRUNED]
        print("  Number of pruned trials:  ", len(trials_pruned))
        study_stats_dict["Number_pruned_trials"] = len(trials_pruned)
    except:
        print("  Number of pruned trials:  0")
        study_stats_dict["Number_pruned_trials"] = 0
    try:
        trials_failed = [t for t in study.trials if t.state == optuna.trial.TrialState.FAILED]
        print("  Number of failed trials:  ", len(trials_failed))
        study_stats_dict["Number_failed_trials"] = len(trials_failed)
    except:
        print("  Number of failed trials:  0")
        study_stats_dict["Number_failed_trials"] = 0
    best_trial = study.best_trial
    print("  Best trial:", best_study_trial_no)
    study_stats_dict["Best_trial_no"] = best_study_trial_no
    print("    Best validation loss:", best_trial.value)
    study_stats_dict["Best_validation_loss"] = best_trial.value
    print("    Best validation LRAP:", best_trial.value)
    study_stats_dict["Best_validation_LRAP"] = best_trial.value
    print(f"    Best parameters (saved at mlruns/{exp_id}/best_params.json):")
    for param,value in best_trial.params.items():
        print("      Best {}:\t{}".format(param, value))
    with open(f'mlruns/{exp_id}/best_params.json', 'w') as f:
        json.dump(best_trial.params, f)
    with open(f'mlruns/{exp_id}/study_stats.json', 'w') as f:
        json.dump(study_stats_dict, f)

    # Save study visualizations
    if not os.path.exists(f'mlruns/{exp_id}/opt_figures'):
        os.makedirs(f'mlruns/{exp_id}/opt_figures')
    fig = optuna.visualization.plot_optimization_history(study)
    fig.write_image(f"mlruns/{exp_id}/opt_figures/study_optimization_history_{runtime_data}.png")
    fig = optuna.visualization.plot_parallel_coordinate(study)
    fig.write_image(f"mlruns/{exp_id}/opt_figures/study_parallel_coordinate_{runtime_data}.png")

    # Cleanup if cleanup fails
    if os.path.exists("transformers_tmp_cache/"):
        shutil.rmtree("transformers_tmp_cache/")
    if os.path.exists("transformers_tmp_outputs/"):
        shutil.rmtree("transformers_tmp_outputs/")

    print("\n\n********************************************************************************")
    print("********************************************************************************\n\n")
    print("Script complete")
    print(f"  Study stats saved at mlruns/{exp_id}/study_stats.json")
    print(f"  Best model parameters are saved at mlruns/{exp_id}/best_params.pickle")
    print(f"  Best model trial is saved at mlruns/{exp_id}/{studyname}{best_study_trial_no}")
    print("  Best model/cache saved at transformers_best_outputs/ and transformers_best_cache/")
    print("\n\nRun `mlflow ui --port 5001` to view model log, parameters, metrics, artifacts\n\n")

Expected behavior Automated hyperparameter tuning should run for the pre-set number of runs without throwing a CUDA Memory Error unless that CUDA Memory Error is due to the model/data being too large to fit on the GPU(s).

Desktop (please complete the following information):

Ubuntu 18.04

Additional context This error was raised in an environment specific to simpletransformers.

ThilinaRajapakse commented 4 years ago

Wow! I've been tearing my hair out on this for quite some time now. Couldn't figure out why the CUDA memory wasn't being cleared despite trying every trick in the book! Thank you for pointing out the cause, it definitely helps!

I've been using bash scripts to run the python scripts when I need to do multiple runs and/or hyperparameter tuning (I use W&B sweeps, thinking of adding a class to the library to support this natively as well). Maybe this will help you as well because you can avoid sacrificing the mixed-precision benefits if you use bash to initiate the python scripts.

AlexMRuch commented 4 years ago

So glad you think my comments are helpful! I've hit this issue with other PyTorch related libraries, like dgl (Deep Graph Library), so I was beginning to think this was an issue specific to optuna or mlflow. Happy to have finally narrowed it down (I think). Had I not figured that out, my next step was definitely using bash scripts.

I may still consider using them, given the FP16 does increase speed a bit. Thanks for letting me know that worked for you!

How do you get the hyperparameter "study" to remember each "trial" suggestion of hyperparameters across runs when you use the bash approach? With optuna, you do a setup like this:

    study = optuna.create_study(
        study_name = studyname,
        direction = "minimize",
        sampler = optuna.samplers.TPESampler(seed = random_seed)
    ) # no pruner because intermediate results not captured
    study.optimize(
        objective,
        n_trials = args.max_runs,
        n_jobs = 1,
        callbacks = [mlflow_callback],
        catch = (RuntimeError,) #pass CUDA Memory Overflow
    )

With a bash approach, to clear out CUDA, I imagine you'd have to have a whole Python process shut down. Do you use subprocess and use one Python runtime to call another and then send input / grab outputs from one process to the other?

Thanks again for this awesome library! Feel free to close this thread if you wish!

ThilinaRajapakse commented 4 years ago

I set up my python script to accept command line arguments for the parameters that I want to "sweep'. Then I just write a bash script that sequentially calls the python script while necessary arguments. It's not exactly elegant though. 😅

I can give an example script later if you need.

ThilinaRajapakse commented 4 years ago

Update: it looks like none of this is necessary with wandb sweeps. It clears out the GPU memory usage between each run.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ThilinaRajapakse / simpletransformers

apex FP16 causes CUDA Memory Overflow with multiple model runs (hyperparameter tuning) -- bypass with FP32 #443