Resume from checkpoint - Githubissues

I have RTX 3090 (24GB) and 64 GB RAM, and 50 GB swap memory, and although training works pretty nicely, unfortunately resuming training from checkpoints results in OOM:
[2021-05-07 19:18:39,962] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-05-07 19:18:39,973] [INFO] [runner.py:360:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 run_clm.py --deepspeed ds_config_gptneo_new.json --model_name_or_path /datadrive/model/checkpoint-800/ --train_file merged_train.txt.csv --do_train --fp16 --overwrite_cache --output_dir /datadrive/model --num_train_epochs 1 --gradient_accumulation_steps 2 --per_device_train_batch_size 4 --use_fast_tokenizer False --learning_rate 5e-06 --save_steps 400
[2021-05-07 19:18:40,526] [INFO] [launch.py:73:main] 0 NCCL_VERSION 2.7.8
[2021-05-07 19:18:40,526] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0]}
[2021-05-07 19:18:40,526] [INFO] [launch.py:86:main] nnodes=1, num_local_procs=1, node_rank=0
[2021-05-07 19:18:40,526] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2021-05-07 19:18:40,526] [INFO] [launch.py:102:main] dist_world_size=1
[2021-05-07 19:18:40,526] [INFO] [launch.py:104:main] Setting CUDA_VISIBLE_DEVICES=0
[2021-05-07 19:18:41,601] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
05/07/2021 19:18:41 - WARNING - __main__ -   Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
05/07/2021 19:18:41 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=/datadrive/model, overwrite_output_dir=False, do_train=True, do_eval=False, do_predict=False, evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=4, per_device_eval_batch_size=8, gradient_accumulation_steps=2, eval_accumulation_steps=None, learning_rate=5e-06, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=runs/May07_19-18-41_9c3c6cac903e, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, save_steps=400, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=/datadrive/model, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=ds_config_gptneo_new.json, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, length_column_name=length, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, _n_gpu=1, mp_parameters=)
05/07/2021 19:18:42 - WARNING - datasets.builder -   Using custom data configuration default-b5898a6a80220f13
05/07/2021 19:18:42 - WARNING - datasets.builder -   Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-b5898a6a80220f13/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)
[INFO|configuration_utils.py:515] 2021-05-07 19:18:42,390 >> loading configuration file /datadrive/model/checkpoint-800/config.json
[INFO|configuration_utils.py:553] 2021-05-07 19:18:42,390 >> Model config GPTNeoConfig {
  "_name_or_path": "EleutherAI/gpt-neo-2.7B",
  "activation_function": "gelu_new",
  "architectures": [
    "GPTNeoForCausalLM"
  ],
  "attention_dropout": 0,
  "attention_layers": [
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local"
  ],
  "attention_types": [
    [
      [
        "global",
        "local"
      ],
      16
    ]
  ],
  "bos_token_id": 50256,
  "embed_dropout": 0,
  "eos_token_id": 50256,
  "gradient_checkpointing": true,
  "hidden_size": 2560,
  "initializer_range": 0.02,
  "intermediate_size": null,
  "layer_norm_epsilon": 1e-05,
  "max_position_embeddings": 2048,
  "model_type": "gpt_neo",
  "num_heads": 20,
  "num_layers": 32,
  "resid_dropout": 0,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50,
      "temperature": 0.9
    }
  },
  "tokenizer_class": "GPT2Tokenizer",
  "transformers_version": "4.6.0.dev0",
  "use_cache": false,
  "vocab_size": 50257,
  "window_size": 256
}

[INFO|configuration_utils.py:517] 2021-05-07 19:18:42,765 >> loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /models/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
[INFO|configuration_utils.py:553] 2021-05-07 19:18:42,765 >> Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.6.0.dev0",
  "use_cache": true,
  "vocab_size": 50257
}

[INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/vocab.json from cache at /models/transformers/684fe667923972fb57f6b4dcb61a3c92763ad89882f3da5da9866baf14f2d60f.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
[INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/merges.txt from cache at /models/transformers/c0c761a63004025aeadd530c4c27b860ec4ecbe8a00531233de21d865a402598.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
[INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer_config.json from cache at None
[INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer.json from cache at /models/transformers/16a2f78023c8dc511294f0c97b5e10fde3ef9889ad6d11ffaa2a00714e73926e.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0
[INFO|modeling_utils.py:1147] 2021-05-07 19:18:44,955 >> loading weights file /datadrive/model/checkpoint-800/pytorch_model.bin
[INFO|modeling_utils.py:1328] 2021-05-07 19:18:59,255 >> All model checkpoint weights were used when initializing GPTNeoForCausalLM.

[INFO|modeling_utils.py:1336] 2021-05-07 19:18:59,255 >> All the weights of GPTNeoForCausalLM were initialized from the model checkpoint at /datadrive/model/checkpoint-800/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPTNeoForCausalLM for predictions without further training.
  0%|                                                     | 0/1 [00:00<?, ?ba/s][WARNING|tokenization_utils_base.py:3170] 2021-05-07 19:19:40,807 >> Token indices sequence length is longer than the specified maximum sequence length for this model (14397149 > 1024). Running this sequence through the model will result in indexing errors
100%|█████████████████████████████████████████████| 1/1 [00:42<00:00, 42.00s/ba]
100%|█████████████████████████████████████████████| 1/1 [00:08<00:00,  8.47s/ba]
[INFO|trainer.py:414] 2021-05-07 19:19:50,812 >> Using amp fp16 backend
[INFO|trainer.py:1042] 2021-05-07 19:19:50,865 >> Loading model from /datadrive/model/checkpoint-800/).
[2021-05-07 19:19:50,867] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.16, git-hash=unknown, git-branch=unknown
[2021-05-07 19:19:50,867] [WARNING] [config.py:79:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-05-07 19:19:54,135] [INFO] [utils.py:11:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Using /root/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.1879847049713135 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2021-05-07 19:19:58,240] [INFO] [engine.py:610:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2021-05-07 19:19:58,240] [INFO] [engine.py:615:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2021-05-07 19:19:58,240] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2021-05-07 19:19:58,240] [INFO] [stage2.py:102:__init__] Reduce bucket size 200000000.0
[2021-05-07 19:19:58,240] [INFO] [stage2.py:103:__init__] Allgather bucket size 200000000.0
[2021-05-07 19:19:58,240] [INFO] [stage2.py:104:__init__] CPU Offload: True
Using /root/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 1.4445114135742188 seconds
[2021-05-07 19:21:35,500] [INFO] [stage2.py:381:__init__] optimizer state initialized
[2021-05-07 19:21:35,709] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2021-05-07 19:21:35,760] [INFO] [engine.py:439:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR
[2021-05-07 19:21:35,761] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fe9d20fb5b0>
[2021-05-07 19:21:35,769] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-06], mom=[[0.9, 0.999]]
[2021-05-07 19:21:35,777] [INFO] [config.py:747:print] DeepSpeedEngine configuration:
[2021-05-07 19:21:35,925] [INFO] [config.py:751:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2021-05-07 19:21:35,926] [INFO] [config.py:751:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2021-05-07 19:21:35,926] [INFO] [config.py:751:print]   allreduce_always_fp32 ........ False
[2021-05-07 19:21:35,927] [INFO] [config.py:751:print]   amp_enabled .................. False
[2021-05-07 19:21:35,927] [INFO] [config.py:751:print]   amp_params ................... False
[2021-05-07 19:21:35,927] [INFO] [config.py:751:print]   checkpoint_tag_validation_enabled  True
[2021-05-07 19:21:35,928] [INFO] [config.py:751:print]   checkpoint_tag_validation_fail  False
[2021-05-07 19:21:35,928] [INFO] [config.py:751:print]   disable_allgather ............ False
[2021-05-07 19:21:35,928] [INFO] [config.py:751:print]   dump_state ................... False
[2021-05-07 19:21:35,929] [INFO] [config.py:751:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2021-05-07 19:21:35,929] [INFO] [config.py:751:print]   elasticity_enabled ........... False
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 3, 
    "detailed": true
}
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   fp16_enabled ................. True
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   global_rank .................. 0
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   gradient_accumulation_steps .. 2
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   gradient_clipping ............ 1.0
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   gradient_predivide_factor .... 1.0
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   initial_dynamic_scale ........ 65536
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   loss_scale ................... 0
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   memory_breakdown ............. False
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   optimizer_legacy_fusion ...... False
[2021-05-07 19:21:35,932] [INFO] [config.py:751:print]   optimizer_name ............... adamw
[2021-05-07 19:21:35,932] [INFO] [config.py:751:print]   optimizer_params ............. {'lr': 5e-06, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0}
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   pld_enabled .................. False
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   pld_params ................... False
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   prescale_gradients ........... False
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   scheduler_name ............... WarmupLR
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 5e-06, 'warmup_num_steps': 0}
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   sparse_attention ............. None
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   sparse_gradients_enabled ..... False
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   steps_per_print .............. 2000
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   tensorboard_enabled .......... False
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   tensorboard_output_path ...... 
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   train_batch_size ............. 8
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   train_micro_batch_size_per_gpu  4
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   wall_clock_breakdown ......... False
[2021-05-07 19:21:35,934] [INFO] [config.py:751:print]   world_size ................... 1
[2021-05-07 19:21:35,934] [INFO] [config.py:751:print]   zero_allow_untested_optimizer  False
[2021-05-07 19:21:35,938] [INFO] [config.py:751:print]   zero_config .................. {
    "stage": 2, 
    "contiguous_gradients": true, 
    "reduce_scatter": true, 
    "reduce_bucket_size": 2.000000e+08, 
    "allgather_partitions": true, 
    "allgather_bucket_size": 2.000000e+08, 
    "overlap_comm": true, 
    "load_from_fp32_weights": true, 
    "elastic_checkpoint": true, 
    "offload_param": null, 
    "offload_optimizer": {
        "device": "cpu", 
        "nvme_path": null, 
        "buffer_count": 4, 
        "pin_memory": false, 
        "pipeline_read": false, 
        "pipeline_write": false, 
        "fast_init": false
    }, 
    "sub_group_size": 1.000000e+12, 
    "prefetch_bucket_size": 5.000000e+07, 
    "param_persistence_threshold": 1.000000e+05, 
    "max_live_parameters": 1.000000e+09, 
    "max_reuse_distance": 1.000000e+09, 
    "gather_fp16_weights_on_model_save": false, 
    "find_unused_parameters": false
}
[2021-05-07 19:21:35,938] [INFO] [config.py:751:print]   zero_enabled ................. True
[2021-05-07 19:21:35,938] [INFO] [config.py:751:print]   zero_optimization_stage ...... 2
[2021-05-07 19:21:35,942] [INFO] [config.py:753:print]   json = {
    "fp16": {
        "enabled": true, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "initial_scale_power": 16, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 5e-06, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08, 
            "weight_decay": 0.0
        }
    }, 
    "scheduler": {
        "type": "WarmupLR", 
        "params": {
            "warmup_min_lr": 0, 
            "warmup_max_lr": 5e-06, 
            "warmup_num_steps": 0
        }
    }, 
    "zero_optimization": {
        "stage": 2, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 2.000000e+08, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 2.000000e+08, 
        "contiguous_gradients": true, 
        "cpu_offload": true
    }, 
    "gradient_accumulation_steps": 2, 
    "gradient_clipping": 1.0, 
    "steps_per_print": 2.000000e+03, 
    "train_batch_size": 8, 
    "train_micro_batch_size_per_gpu": 4, 
    "wall_clock_breakdown": false
}
Using /root/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.09232521057128906 seconds
[INFO|integrations.py:536] 2021-05-07 19:21:36,160 >> Attempting to resume from /datadrive/model/checkpoint-800/
[2021-05-07 19:21:36,175] [INFO] [engine.py:1480:_load_checkpoint] rank: 0 loading checkpoint: /datadrive/model/checkpoint-800/global_step800/mp_rank_00_model_states.pt
Xirider / finetune-gpt2xl

Resume from checkpoint #9