Xirider / finetune-gpt2xl

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and finetune GPT-NEO (2.7 B) on a single GPU with Huggingface Transformers using DeepSpeed
MIT License
431 stars 73 forks source link

Resume from checkpoint #9

Closed ArturTanona closed 3 years ago

ArturTanona commented 3 years ago

I have RTX 3090 (24GB) and 64 GB RAM, and 50 GB swap memory, and although training works pretty nicely, unfortunately resuming training from checkpoints results in OOM:

[2021-05-07 19:18:39,962] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-05-07 19:18:39,973] [INFO] [runner.py:360:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 run_clm.py --deepspeed ds_config_gptneo_new.json --model_name_or_path /datadrive/model/checkpoint-800/ --train_file merged_train.txt.csv --do_train --fp16 --overwrite_cache --output_dir /datadrive/model --num_train_epochs 1 --gradient_accumulation_steps 2 --per_device_train_batch_size 4 --use_fast_tokenizer False --learning_rate 5e-06 --save_steps 400
[2021-05-07 19:18:40,526] [INFO] [launch.py:73:main] 0 NCCL_VERSION 2.7.8
[2021-05-07 19:18:40,526] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0]}
[2021-05-07 19:18:40,526] [INFO] [launch.py:86:main] nnodes=1, num_local_procs=1, node_rank=0
[2021-05-07 19:18:40,526] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2021-05-07 19:18:40,526] [INFO] [launch.py:102:main] dist_world_size=1
[2021-05-07 19:18:40,526] [INFO] [launch.py:104:main] Setting CUDA_VISIBLE_DEVICES=0
[2021-05-07 19:18:41,601] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
05/07/2021 19:18:41 - WARNING - __main__ -   Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
05/07/2021 19:18:41 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=/datadrive/model, overwrite_output_dir=False, do_train=True, do_eval=False, do_predict=False, evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=4, per_device_eval_batch_size=8, gradient_accumulation_steps=2, eval_accumulation_steps=None, learning_rate=5e-06, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=runs/May07_19-18-41_9c3c6cac903e, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, save_steps=400, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=/datadrive/model, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=ds_config_gptneo_new.json, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, length_column_name=length, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, _n_gpu=1, mp_parameters=)
05/07/2021 19:18:42 - WARNING - datasets.builder -   Using custom data configuration default-b5898a6a80220f13
05/07/2021 19:18:42 - WARNING - datasets.builder -   Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-b5898a6a80220f13/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0)
[INFO|configuration_utils.py:515] 2021-05-07 19:18:42,390 >> loading configuration file /datadrive/model/checkpoint-800/config.json
[INFO|configuration_utils.py:553] 2021-05-07 19:18:42,390 >> Model config GPTNeoConfig {
  "_name_or_path": "EleutherAI/gpt-neo-2.7B",
  "activation_function": "gelu_new",
  "architectures": [
    "GPTNeoForCausalLM"
  ],
  "attention_dropout": 0,
  "attention_layers": [
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local"
  ],
  "attention_types": [
    [
      [
        "global",
        "local"
      ],
      16
    ]
  ],
  "bos_token_id": 50256,
  "embed_dropout": 0,
  "eos_token_id": 50256,
  "gradient_checkpointing": true,
  "hidden_size": 2560,
  "initializer_range": 0.02,
  "intermediate_size": null,
  "layer_norm_epsilon": 1e-05,
  "max_position_embeddings": 2048,
  "model_type": "gpt_neo",
  "num_heads": 20,
  "num_layers": 32,
  "resid_dropout": 0,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50,
      "temperature": 0.9
    }
  },
  "tokenizer_class": "GPT2Tokenizer",
  "transformers_version": "4.6.0.dev0",
  "use_cache": false,
  "vocab_size": 50257,
  "window_size": 256
}

[INFO|configuration_utils.py:517] 2021-05-07 19:18:42,765 >> loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /models/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
[INFO|configuration_utils.py:553] 2021-05-07 19:18:42,765 >> Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.6.0.dev0",
  "use_cache": true,
  "vocab_size": 50257
}

[INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/vocab.json from cache at /models/transformers/684fe667923972fb57f6b4dcb61a3c92763ad89882f3da5da9866baf14f2d60f.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
[INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/merges.txt from cache at /models/transformers/c0c761a63004025aeadd530c4c27b860ec4ecbe8a00531233de21d865a402598.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
[INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer_config.json from cache at None
[INFO|tokenization_utils_base.py:1717] 2021-05-07 19:18:44,877 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer.json from cache at /models/transformers/16a2f78023c8dc511294f0c97b5e10fde3ef9889ad6d11ffaa2a00714e73926e.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0
[INFO|modeling_utils.py:1147] 2021-05-07 19:18:44,955 >> loading weights file /datadrive/model/checkpoint-800/pytorch_model.bin
[INFO|modeling_utils.py:1328] 2021-05-07 19:18:59,255 >> All model checkpoint weights were used when initializing GPTNeoForCausalLM.

[INFO|modeling_utils.py:1336] 2021-05-07 19:18:59,255 >> All the weights of GPTNeoForCausalLM were initialized from the model checkpoint at /datadrive/model/checkpoint-800/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPTNeoForCausalLM for predictions without further training.
  0%|                                                     | 0/1 [00:00<?, ?ba/s][WARNING|tokenization_utils_base.py:3170] 2021-05-07 19:19:40,807 >> Token indices sequence length is longer than the specified maximum sequence length for this model (14397149 > 1024). Running this sequence through the model will result in indexing errors
100%|█████████████████████████████████████████████| 1/1 [00:42<00:00, 42.00s/ba]
100%|█████████████████████████████████████████████| 1/1 [00:08<00:00,  8.47s/ba]
[INFO|trainer.py:414] 2021-05-07 19:19:50,812 >> Using amp fp16 backend
[INFO|trainer.py:1042] 2021-05-07 19:19:50,865 >> Loading model from /datadrive/model/checkpoint-800/).
[2021-05-07 19:19:50,867] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.16, git-hash=unknown, git-branch=unknown
[2021-05-07 19:19:50,867] [WARNING] [config.py:79:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-05-07 19:19:54,135] [INFO] [utils.py:11:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Using /root/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.1879847049713135 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2021-05-07 19:19:58,240] [INFO] [engine.py:610:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2021-05-07 19:19:58,240] [INFO] [engine.py:615:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2021-05-07 19:19:58,240] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2021-05-07 19:19:58,240] [INFO] [stage2.py:102:__init__] Reduce bucket size 200000000.0
[2021-05-07 19:19:58,240] [INFO] [stage2.py:103:__init__] Allgather bucket size 200000000.0
[2021-05-07 19:19:58,240] [INFO] [stage2.py:104:__init__] CPU Offload: True
Using /root/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 1.4445114135742188 seconds
[2021-05-07 19:21:35,500] [INFO] [stage2.py:381:__init__] optimizer state initialized
[2021-05-07 19:21:35,709] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2021-05-07 19:21:35,760] [INFO] [engine.py:439:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR
[2021-05-07 19:21:35,761] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fe9d20fb5b0>
[2021-05-07 19:21:35,769] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-06], mom=[[0.9, 0.999]]
[2021-05-07 19:21:35,777] [INFO] [config.py:747:print] DeepSpeedEngine configuration:
[2021-05-07 19:21:35,925] [INFO] [config.py:751:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2021-05-07 19:21:35,926] [INFO] [config.py:751:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2021-05-07 19:21:35,926] [INFO] [config.py:751:print]   allreduce_always_fp32 ........ False
[2021-05-07 19:21:35,927] [INFO] [config.py:751:print]   amp_enabled .................. False
[2021-05-07 19:21:35,927] [INFO] [config.py:751:print]   amp_params ................... False
[2021-05-07 19:21:35,927] [INFO] [config.py:751:print]   checkpoint_tag_validation_enabled  True
[2021-05-07 19:21:35,928] [INFO] [config.py:751:print]   checkpoint_tag_validation_fail  False
[2021-05-07 19:21:35,928] [INFO] [config.py:751:print]   disable_allgather ............ False
[2021-05-07 19:21:35,928] [INFO] [config.py:751:print]   dump_state ................... False
[2021-05-07 19:21:35,929] [INFO] [config.py:751:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2021-05-07 19:21:35,929] [INFO] [config.py:751:print]   elasticity_enabled ........... False
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 3, 
    "detailed": true
}
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   fp16_enabled ................. True
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   global_rank .................. 0
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   gradient_accumulation_steps .. 2
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   gradient_clipping ............ 1.0
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   gradient_predivide_factor .... 1.0
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   initial_dynamic_scale ........ 65536
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   loss_scale ................... 0
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   memory_breakdown ............. False
[2021-05-07 19:21:35,931] [INFO] [config.py:751:print]   optimizer_legacy_fusion ...... False
[2021-05-07 19:21:35,932] [INFO] [config.py:751:print]   optimizer_name ............... adamw
[2021-05-07 19:21:35,932] [INFO] [config.py:751:print]   optimizer_params ............. {'lr': 5e-06, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0}
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   pld_enabled .................. False
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   pld_params ................... False
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   prescale_gradients ........... False
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   scheduler_name ............... WarmupLR
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 5e-06, 'warmup_num_steps': 0}
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   sparse_attention ............. None
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   sparse_gradients_enabled ..... False
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   steps_per_print .............. 2000
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   tensorboard_enabled .......... False
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   tensorboard_output_path ...... 
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   train_batch_size ............. 8
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   train_micro_batch_size_per_gpu  4
[2021-05-07 19:21:35,933] [INFO] [config.py:751:print]   wall_clock_breakdown ......... False
[2021-05-07 19:21:35,934] [INFO] [config.py:751:print]   world_size ................... 1
[2021-05-07 19:21:35,934] [INFO] [config.py:751:print]   zero_allow_untested_optimizer  False
[2021-05-07 19:21:35,938] [INFO] [config.py:751:print]   zero_config .................. {
    "stage": 2, 
    "contiguous_gradients": true, 
    "reduce_scatter": true, 
    "reduce_bucket_size": 2.000000e+08, 
    "allgather_partitions": true, 
    "allgather_bucket_size": 2.000000e+08, 
    "overlap_comm": true, 
    "load_from_fp32_weights": true, 
    "elastic_checkpoint": true, 
    "offload_param": null, 
    "offload_optimizer": {
        "device": "cpu", 
        "nvme_path": null, 
        "buffer_count": 4, 
        "pin_memory": false, 
        "pipeline_read": false, 
        "pipeline_write": false, 
        "fast_init": false
    }, 
    "sub_group_size": 1.000000e+12, 
    "prefetch_bucket_size": 5.000000e+07, 
    "param_persistence_threshold": 1.000000e+05, 
    "max_live_parameters": 1.000000e+09, 
    "max_reuse_distance": 1.000000e+09, 
    "gather_fp16_weights_on_model_save": false, 
    "find_unused_parameters": false
}
[2021-05-07 19:21:35,938] [INFO] [config.py:751:print]   zero_enabled ................. True
[2021-05-07 19:21:35,938] [INFO] [config.py:751:print]   zero_optimization_stage ...... 2
[2021-05-07 19:21:35,942] [INFO] [config.py:753:print]   json = {
    "fp16": {
        "enabled": true, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "initial_scale_power": 16, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 5e-06, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08, 
            "weight_decay": 0.0
        }
    }, 
    "scheduler": {
        "type": "WarmupLR", 
        "params": {
            "warmup_min_lr": 0, 
            "warmup_max_lr": 5e-06, 
            "warmup_num_steps": 0
        }
    }, 
    "zero_optimization": {
        "stage": 2, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 2.000000e+08, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 2.000000e+08, 
        "contiguous_gradients": true, 
        "cpu_offload": true
    }, 
    "gradient_accumulation_steps": 2, 
    "gradient_clipping": 1.0, 
    "steps_per_print": 2.000000e+03, 
    "train_batch_size": 8, 
    "train_micro_batch_size_per_gpu": 4, 
    "wall_clock_breakdown": false
}
Using /root/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.09232521057128906 seconds
[INFO|integrations.py:536] 2021-05-07 19:21:36,160 >> Attempting to resume from /datadrive/model/checkpoint-800/
[2021-05-07 19:21:36,175] [INFO] [engine.py:1480:_load_checkpoint] rank: 0 loading checkpoint: /datadrive/model/checkpoint-800/global_step800/mp_rank_00_model_states.pt
Xirider commented 3 years ago

You are loading from the deepspeed checkpoints (mp_rank...) . I am not sure if they work yet with huggingface transformers and also they are quite huge: 10s of GB. I would recommend you too delete the global_step folder (/datadrive/model/checkpoint-800/global_step800/ in your case) and just start from the model in the checkpoint folder (/datadrive/model/checkpoint-800/) again. This way deepspeed won't try to resume from the deepspeed checkpoint.

In general if there are memory issues, you can always try to reduce the batch size (and in turn increase gradient_accumulation) and you can reduce allgather_bucket_size and reduce_bucket_size to 5e7 in the ds_config_gptneo_new.json file.

ArturTanona commented 3 years ago

Worked! Many thanks! And thanks for this awesome repo!