Closed pwstegman closed 2 years ago
I created a new Python 3.8.0 virtual env, installed torch 1.8.2+cu111, installed the latest NVIDIA apex, then installed all the requirements in each file within the requirements directory of GPT-NeoX.
I notice that you do not explicitly mention DeepSpeed or DeeperSpeed. As GPT-NeoX relies on DeeperSpeed for ZeRO and the rest of its features for distributed training, it is an important piece of the puzzle. DeeperSpeed should be installed from ./requirements/requirements.txt, so I would verify that install completes correctly and test ZeRO 2 on some other code so we can be sure this is a GPT-NeoX problem rather than a DeeperSpeed problem.
Starting on a host with NVIDIA driver 460.91.03, I ran a new docker container using image
nvidia/cuda:11.1.1-devel-ubuntu18.04
.
I would strongly suggest using our prebuilt images or building your own image from the included Dockerfile—all of the critical dependencies will be preinstalled and ready to use. It is a lot easier to use our verified environment than to install the dependencies from scratch yourself.
Thanks for the reply! I installed directly from the requirements.txt files, so DeeperSpeed was installed. I've trained using ZeRO 1 without issue; the error only occurs once I've made the following two config changes to small.yml:
"stage": 2
"cpu_offload": True
I gave the docker container referenced in the README a try and got the same issue:
git clone https://github.com/EleutherAI/gpt-neox.git
cd gpt-neox
nvidia-docker run --rm -it -e NVIDIA_VISIBLE_DEVICES=0,1,2,3 --shm-size=1g --ulimit memlock=-1 --mount type=bind,src=$PWD,dst=/gpt-neox leogao2/gpt-neox:sha-63e10af
cd /gpt-neox
python prepare_data.py enron
# Confirm default config works
python deepy.py train.py -d configs/ small.yml local_setup.yml
# Got error saying to run the below command (sudo -s added by me to correct permission denied errors)
sudo -s
python /gpt-neox/megatron/fused_kernels/setup.py install
# Rerun the original command
python deepy.py train.py -d configs/ small.yml local_setup.yml
# Training ran without issue. CTRL+Ced out after 60 steps
# Enable ZeRO 2 cpu_offload
cp configs/small.yml configs/small_offload.yml
sed -i 's/"stage": 0/"stage": 2/g' configs/small_offload.yml
sed -i 's/"cpu_offload": False/"cpu_offload": True/g' configs/small_offload.yml
# Run with CPU offload
python deepy.py train.py -d configs/ small_offload.yml local_setup.yml
# The above command throws "RuntimeError: expected input to be on cuda"
NeoXArgs.from_ymls() ['configs/small_offload.yml', 'configs/local_setup.yml'] INFO:root:NeoXArgs.calculate_derived() Total number of GPUs determined to be: 4 -------------------- arguments -------------------- attention_config ................ ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global']updated attention_dropout ............... 0.0.........................updated batch_size ...................... 4...........................updated checkpoint_activations .......... True........................updated clip_grad ....................... 1.0.........................updated config_files .................... {'small_offload.yml': '# GPT-2 pretraining setup\n{\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n # across the node boundaries )\n "pipe-parallel-size": 1,\n "model-parallel-size": 1,\n\n # model settings\n "num-layers": 12,\n "hidden-size": 768,\n "num-attention-heads": 12,\n "seq-length": 2048,\n "max-position-embeddings": 2048,\n "norm": "layernorm",\n "pos-emb": "rotary",\n "no-weight-tying": true,\n\n # these should provide some speedup but takes a while to build, set to true if desired\n "scaled-upper-triang-masked-softmax-fusion": false,\n "bias-gelu-fusion": false,\n\n\n # optimizer settings\n "optimizer": {\n "type": "Adam",\n "params": {\n "lr": 0.0006,\n "betas": [0.9, 0.999],\n "eps": 1.0e-8,\n }\n },\n "zero_optimization": {\n "stage": 2,\n "allgather_partitions": True,\n "allgather_bucket_size": 500000000,\n "overlap_comm": True,\n "reduce_scatter": True,\n "reduce_bucket_size": 500000000,\n "contiguous_gradients": True,\n "cpu_offload": True\n },\n\n # batch / data settings\n "train_micro_batch_size_per_gpu": 4,\n "data-impl": "mmap",\n "split": "949,50,1",\n\n # activation checkpointing\n "checkpoint-activations": true,\n "checkpoint-num-layers": 1,\n "partition-activations": true,\n "synchronize-each-layer": true,\n\n # regularization\n "gradient_clipping": 1.0,\n "weight-decay": 0.0,\n "hidden-dropout": 0.0,\n "attention-dropout": 0.0,\n\n # precision settings\n "fp16": { \n "enabled": true,\n "loss_scale": 0,\n "loss_scale_window": 1000,\n "hysteresis": 2,\n "min_loss_scale": 1\n },\n\n # misc. training settings\n "train-iters": 320000,\n "lr-decay-iters": 320000,\n "distributed-backend": "nccl",\n "lr-decay-style": "cosine",\n "warmup": 0.01,\n "save-interval": 10000,\n "eval-interval": 1000,\n "eval-iters": 10,\n\n # logging\n "log-interval": 100,\n "steps_per_print": 10,\n "keep-last-n-checkpoints": 4,\n "wall_clock_breakdown": true,\n}\n', 'local_setup.yml': '# Suggested data paths when using GPT-NeoX locally\n{\n "data-path": "data/enron/enron_text_document",\n \n # or for weighted datasets: \n # "train-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n # "test-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n # "valid-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n # "train-data-weights": [1., 2.],\n # "test-data-weights": [2., 1.],\n # "valid-data-weights": [0.5, 0.4],\n\n # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n # WARNING: setting this to True will override any user provided weights\n # "weight_by_num_documents": false,\n # "weighted_sampler_alpha": 0.3,\n\n "vocab-file": "data/gpt2-vocab.json",\n "merge-file": "data/gpt2-merges.txt",\n\n "save": "checkpoints",\n "load": "checkpoints",\n "checkpoint_validation_with_forward_pass": False,\n \n "tensorboard-dir": "tensorboard",\n "log-dir": "logs",\n "use_wandb": True,\n "wandb_host": "https://api.wandb.ai",\n "wandb_project": "neox"\n}'}updated data_impl ....................... mmap........................updated data_path ....................... data/enron/enron_text_documentupdated dynamic_loss_scale .............. True........................updated eval_iters ...................... 10..........................updated fp16 ............................ {'enabled': True, 'loss_scale': 0, 'loss_scale_window': 1000, 'hysteresis': 2, 'min_loss_scale': 1}updated gas ............................. 1...........................updated global_num_gpus ................. 4...........................updated gradient_clipping ............... 1.0.........................updated hidden_dropout .................. 0.0.........................updated hidden_size ..................... 768.........................updated is_pipe_parallel ................ True........................updated keep_last_n_checkpoints ......... 4...........................updated load ............................ checkpoints.................updated log_dir ......................... logs........................updated log_interval .................... 100.........................updated lr .............................. 0.0006......................updated lr_decay_iters .................. 320000......................updated lr_decay_style .................. cosine......................updated max_position_embeddings ......... 2048........................updated merge_file ...................... data/gpt2-merges.txt........updated no_weight_tying ................. True........................updated num_attention_heads ............. 12..........................updated num_layers ...................... 12..........................updated optimizer ....................... {'type': 'Adam', 'params': {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}}updated optimizer_type .................. Adam........................updated partition_activations ........... True........................updated pipe_parallel_size .............. 1...........................updated pos_emb ......................... rotary......................updated precision ....................... fp16........................updated save ............................ checkpoints.................updated save_interval ................... 10000.......................updated seq_length ...................... 2048........................updated sparsity_config ................. {}..........................updated split ........................... 949,50,1....................updated synchronize_each_layer .......... True........................updated tensorboard_dir ................. tensorboard.................updated train_batch_size ................ 16..........................updated train_iters ..................... 320000......................updated train_micro_batch_size_per_gpu .. 4...........................updated use_wandb ....................... True........................updated user_script ..................... train.py....................updated vocab_file ...................... data/gpt2-vocab.json........updated wall_clock_breakdown ............ True........................updated wandb_group ..................... QZUNHRmPyM5gy9C2ReHsUj_1fcswh3eupdated weight_decay .................... 0.0.........................updated zero_allgather_bucket_size ...... 500000000...................updated zero_contiguous_gradients ....... True........................updated zero_optimization ............... {'stage': 2, 'allgather_partitions': True, 'allgather_bucket_size': 500000000, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 500000000, 'contiguous_gradients': True, 'cpu_offload': True}updated zero_reduce_bucket_size ......... 500000000...................updated zero_reduce_scatter ............. True........................updated zero_stage ...................... 2...........................updated activation ...................... gelu........................default adlr_autoresume ................. False.......................default adlr_autoresume_interval ........ 1000........................default amp ............................. None........................default apply_query_key_layer_scaling ... False.......................default attention_softmax_in_fp32 ....... False.......................default bias_dropout_fusion ............. False.......................default bias_gelu_fusion ................ False.......................default char_level_ppl .................. False.......................default checkpoint_in_cpu ............... False.......................default checkpoint_num_layers ........... 1...........................default checkpoint_validation_with_forward_pass False................default contiguous_checkpointing ........ False.......................default deepscale ....................... False.......................default deepscale_config ................ None........................default deepspeed ....................... True........................default deepspeed_activation_checkpointing True......................default deepspeed_mpi ................... False.......................default detect_nvlink_pairs ............. False.......................default distributed_backend ............. nccl........................default do_test ......................... None........................default do_train ........................ None........................default do_valid ........................ None........................default dump_state ...................... False.......................default eod_mask_loss ................... False.......................default eval_interval ................... 1000........................default eval_results_prefix ............. ............................default eval_tasks ...................... None........................default exclude ......................... None........................default exit_interval ................... None........................default finetune ........................ False.......................default flops_profiler .................. None........................default fp16_lm_cross_entropy ........... False.......................default fp32_allreduce .................. False.......................default git_hash ........................ 49e60fe.....................default gmlp_attn_dim ................... 64..........................default gpt_j_residual .................. False.......................default gradient_accumulation_steps ..... 1...........................default gradient_noise_scale_cpu_offload False.......................default gradient_noise_scale_n_batches .. 5...........................default gradient_predivide_factor ....... 1.0.........................default hostfile ........................ None........................default hysteresis ...................... 2...........................default include ......................... None........................default init_method ..................... normal......................default init_method_std ................. 0.02........................default iteration ....................... None........................default launcher ........................ pdsh........................default layernorm_epsilon ............... 1e-05.......................default lazy_mpu_init ................... False.......................default local_rank ...................... None........................default log_grad_norm ................... False.......................default log_gradient_noise_scale ........ False.......................default log_optimizer_states ............ False.......................default log_param_norm .................. False.......................default loss_scale ...................... None........................default loss_scale_window ............... 1000.0......................default make_vocab_size_divisible_by .... 128.........................default master_addr ..................... None........................default master_port ..................... 29500.......................default maximum_tokens .................. 64..........................default min_lr .......................... 0.0.........................default min_scale ....................... 1.0.........................default mmap_warmup ..................... False.......................default model_parallel_size ............. 1...........................default no_load_optim ................... False.......................default no_load_rng ..................... False.......................default no_save_optim ................... False.......................default no_save_rng ..................... False.......................default norm ............................ layernorm...................default num_gpus ........................ None........................default num_nodes ....................... -1..........................default num_samples ..................... 0...........................default num_unique_layers ............... None........................default num_workers ..................... 2...........................default onnx_safe ....................... False.......................default output_layer_init_method ........ scaled_normal...............default output_layer_parallelism ........ row.........................default override_lr_scheduler ........... False.......................default padded_vocab_size ............... None........................default param_sharing_style ............. grouped.....................default pipe_partition_method ........... type:transformer|mlp........default prescale_gradients .............. False.......................default profile_backward ................ False.......................default rank ............................ None........................default recompute ....................... False.......................default rms_norm_epsilon ................ 1e-08.......................default rotary_emb_base ................. 10000.......................default rotary_pct ...................... 1.0.........................default rpe_max_distance ................ 128.........................default rpe_num_buckets ................. 32..........................default sample_input_file ............... None........................default sample_output_file .............. None........................default scaled_masked_softmax_fusion .... False.......................default scaled_upper_triang_masked_softmax_fusion False..............default scalenorm_epsilon ............... 1e-08.......................default scheduler ....................... None........................default seed ............................ 1234........................default short_seq_prob .................. 0.1.........................default soft_prompt_tuning .............. None........................default sparse_gradients ................ False.......................default steps_per_print ................. 10..........................default temperature ..................... 0.0.........................default test_data_paths ................. None........................default test_data_weights ............... None........................default text_gen_type ................... None........................default tokenizer_type .................. GPT2BPETokenizer............default top_k ........................... 0...........................default top_p ........................... 0.0.........................default train_data_paths ................ None........................default train_data_weights .............. None........................default use_bnb_optimizer ............... False.......................default use_checkpoint_lr_scheduler ..... False.......................default use_cpu_initialization .......... False.......................default valid_data_paths ................ None........................default valid_data_weights .............. None........................default wandb_host ...................... https://api.wandb.ai........default wandb_project ................... neox........................default wandb_team ...................... None........................default warmup .......................... 0.01........................default weight_by_num_documents ......... False.......................default weighted_sampler_alpha .......... 0.3.........................default world_size ...................... None........................default zero_allow_untested_optimizer ... False.......................default ---------------- end of arguments ---------------- [2021-12-11 00:42:41,060] [WARNING] [runner.py:126:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2021-12-11 00:42:41,060] [INFO] [runner.py:366:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 train.py --deepspeed_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true} --megatron_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 12, "hidden_size": 768, "num_attention_heads": 12, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "lr_decay_style": "cosine", "lr_decay_iters": 320000, "optimizer_type": "Adam", "zero_stage": 2, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.0006, "data_path": "data/enron/enron_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"small_offload.yml": "# GPT-2 pretraining setup\n{\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n # across the node boundaries )\n \"pipe-parallel-size\": 1,\n \"model-parallel-size\": 1,\n\n # model settings\n \"num-layers\": 12,\n \"hidden-size\": 768,\n \"num-attention-heads\": 12,\n \"seq-length\": 2048,\n \"max-position-embeddings\": 2048,\n \"norm\": \"layernorm\",\n \"pos-emb\": \"rotary\",\n \"no-weight-tying\": true,\n\n # these should provide some speedup but takes a while to build, set to true if desired\n \"scaled-upper-triang-masked-softmax-fusion\": false,\n \"bias-gelu-fusion\": false,\n\n\n # optimizer settings\n \"optimizer\": {\n \"type\": \"Adam\",\n \"params\": {\n \"lr\": 0.0006,\n \"betas\": [0.9, 0.999],\n \"eps\": 1.0e-8,\n }\n },\n \"zero_optimization\": {\n \"stage\": 2,\n \"allgather_partitions\": True,\n \"allgather_bucket_size\": 500000000,\n \"overlap_comm\": True,\n \"reduce_scatter\": True,\n \"reduce_bucket_size\": 500000000,\n \"contiguous_gradients\": True,\n \"cpu_offload\": True\n },\n\n # batch / data settings\n \"train_micro_batch_size_per_gpu\": 4,\n \"data-impl\": \"mmap\",\n \"split\": \"949,50,1\",\n\n # activation checkpointing\n \"checkpoint-activations\": true,\n \"checkpoint-num-layers\": 1,\n \"partition-activations\": true,\n \"synchronize-each-layer\": true,\n\n # regularization\n \"gradient_clipping\": 1.0,\n \"weight-decay\": 0.0,\n \"hidden-dropout\": 0.0,\n \"attention-dropout\": 0.0,\n\n # precision settings\n \"fp16\": { \n \"enabled\": true,\n \"loss_scale\": 0,\n \"loss_scale_window\": 1000,\n \"hysteresis\": 2,\n \"min_loss_scale\": 1\n },\n\n # misc. training settings\n \"train-iters\": 320000,\n \"lr-decay-iters\": 320000,\n \"distributed-backend\": \"nccl\",\n \"lr-decay-style\": \"cosine\",\n \"warmup\": 0.01,\n \"save-interval\": 10000,\n \"eval-interval\": 1000,\n \"eval-iters\": 10,\n\n # logging\n \"log-interval\": 100,\n \"steps_per_print\": 10,\n \"keep-last-n-checkpoints\": 4,\n \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n \"data-path\": \"data/enron/enron_text_document\",\n \n # or for weighted datasets: \n # \"train-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"test-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"valid-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"train-data-weights\": [1., 2.],\n # \"test-data-weights\": [2., 1.],\n # \"valid-data-weights\": [0.5, 0.4],\n\n # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n # WARNING: setting this to True will override any user provided weights\n # \"weight_by_num_documents\": false,\n # \"weighted_sampler_alpha\": 0.3,\n\n \"vocab-file\": \"data/gpt2-vocab.json\",\n \"merge-file\": \"data/gpt2-merges.txt\",\n\n \"save\": \"checkpoints\",\n \"load\": \"checkpoints\",\n \"checkpoint_validation_with_forward_pass\": False,\n \n \"tensorboard-dir\": \"tensorboard\",\n \"log-dir\": \"logs\",\n \"use_wandb\": True,\n \"wandb_host\": \"https://api.wandb.ai\",\n \"wandb_project\": \"neox\"\n}"}, "load": "checkpoints", "save_interval": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "split": "949,50,1", "vocab_file": "data/gpt2-vocab.json", "merge_file": "data/gpt2-merges.txt", "attention_dropout": 0.0, "hidden_dropout": 0.0, "weight_decay": 0.0, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 1, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 1, "is_pipe_parallel": true, "use_wandb": true, "wandb_group": "QZUNHRmPyM5gy9C2ReHsUj_1fcswh3e", "log_dir": "logs", "tensorboard_dir": "tensorboard", "log_interval": 100, "user_script": "train.py", "global_num_gpus": 4} [2021-12-11 00:42:42,055] [INFO] [launch.py:82:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2021-12-11 00:42:42,055] [INFO] [launch.py:88:main] nnodes=1, num_local_procs=4, node_rank=0 [2021-12-11 00:42:42,055] [INFO] [launch.py:103:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3]}) [2021-12-11 00:42:42,055] [INFO] [launch.py:104:main] dist_world_size=4 [2021-12-11 00:42:42,055] [INFO] [launch.py:112:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 NeoXArgs.configure_distributed_args() using world size: 4 and model-parallel size: 1 > building GPT2BPETokenizer tokenizer ... [2021-12-11 00:42:45,229] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-12-11 00:42:45,233] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-12-11 00:42:45,240] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304) WARNING: TensorBoard writing requested but is not available (are you using PyTorch 1.1.0 or later and do you have tensorboard installed?), no TensorBoard logs will be written. > initializing torch distributed ... [2021-12-11 00:42:45,284] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl > initializing model parallel with size 1 MPU DP: [0, 1, 2, 3] MPU PP: [0] MPU PP: [1] MPU PP: [2] MPU PP: [3] MPU MP: [0] MPU MP: [1] MPU MP: [2] MPU MP: [3] > setting random seeds to 1234 ... [2021-12-11 00:42:46,317] [INFO] [checkpointing.py:223:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 make: Entering directory '/gpt-neox/megatron/data' make: Nothing to be done for 'default'. make: Leaving directory '/gpt-neox/megatron/data' building GPT2 model ... SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1, ProcessCoord(pipe=0, data=2, model=0): 2, ProcessCoord(pipe=0, data=3, model=0): 3} [2021-12-11 00:42:50,771] [INFO] [module.py:363:_partition_layers] Partitioning pipeline stages with method type:transformer|mlp stage=0 layers=17 0: EmbeddingPipe 1: _pre_transformer_block 2: ParallelTransformerLayerPipe 3: ParallelTransformerLayerPipe 4: ParallelTransformerLayerPipe 5: ParallelTransformerLayerPipe 6: ParallelTransformerLayerPipe 7: ParallelTransformerLayerPipe 8: ParallelTransformerLayerPipe 9: ParallelTransformerLayerPipe 10: ParallelTransformerLayerPipe 11: ParallelTransformerLayerPipe 12: ParallelTransformerLayerPipe 13: ParallelTransformerLayerPipe 14: _post_transformer_block 15: NormPipe 16: ParallelLinearPipe loss: partial [2021-12-11 00:42:50,808] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer. Configuring Optimizer type: Adam with params: {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08} > learning rate decay style: cosine DeepSpeed is enabled. [2021-12-11 00:42:50,810] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.15+eb7f5cf, git-hash=eb7f5cf, git-branch=main [2021-12-11 00:42:50,811] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer. [2021-12-11 00:42:50,813] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer. [2021-12-11 00:42:50,814] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer. [2021-12-11 00:42:52,185] [INFO] [engine.py:654:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer [2021-12-11 00:42:52,186] [INFO] [engine.py:659:_configure_optimizer] Using client Optimizer as basic optimizer [2021-12-11 00:42:52,186] [INFO] [engine.py:668:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam Checking ZeRO support for optimizer=FusedAdam type= [2021-12-11 00:42:52,186] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer [2021-12-11 00:42:52,186] [INFO] [stage2.py:102:__init__] Reduce bucket size 500000000 [2021-12-11 00:42:52,186] [INFO] [stage2.py:103:__init__] Allgather bucket size 500000000 [2021-12-11 00:42:52,186] [INFO] [stage2.py:104:__init__] CPU Offload: True Using /root/.cache/torch_extensions as PyTorch extensions root... Using /root/.cache/torch_extensions as PyTorch extensions root... Using /root/.cache/torch_extensions as PyTorch extensions root... Using /root/.cache/torch_extensions as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.35228919982910156 seconds Loading extension module utils...Loading extension module utils... Loading extension module utils... Time to load utils op: 0.402698278427124 secondsTime to load utils op: 0.40264415740966797 seconds Time to load utils op: 0.40321850776672363 seconds Traceback (most recent call last): File "train.py", line 27, in pretrain(neox_args=neox_args) File "/gpt-neox/megatron/training.py", line 82, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer( File "/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer model, optimizer, _, lr_scheduler = deepspeed.initialize( File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 128, in initialize engine = PipelineEngine(args=args, File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 60, in __init__ super().__init__(*super_args, **super_kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 192, in __init__ self._configure_optimizer(optimizer, model_parameters) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer self.optimizer = self._configure_zero_optimizer(basic_optimizer) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer optimizer = FP16_DeepSpeedZeroOptimizer( File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__ self.initialize_optimizer_states() File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states self.optimizer.step() File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 159, in step multi_tensor_applier(self.multi_tensor_adam, File "/usr/local/lib/python3.8/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__ return op(self.chunk_size, RuntimeError: expected input to be on cuda Traceback (most recent call last): File "train.py", line 27, in pretrain(neox_args=neox_args) File "/gpt-neox/megatron/training.py", line 82, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer( File "/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer model, optimizer, _, lr_scheduler = deepspeed.initialize( File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 128, in initialize engine = PipelineEngine(args=args, File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 60, in __init__ super().__init__(*super_args, **super_kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 192, in __init__ self._configure_optimizer(optimizer, model_parameters) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer self.optimizer = self._configure_zero_optimizer(basic_optimizer) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer optimizer = FP16_DeepSpeedZeroOptimizer( File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__ self.initialize_optimizer_states() File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states self.optimizer.step() File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 159, in step multi_tensor_applier(self.multi_tensor_adam, File "/usr/local/lib/python3.8/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__ return op(self.chunk_size, RuntimeError: expected input to be on cuda Traceback (most recent call last): File "train.py", line 27, in pretrain(neox_args=neox_args) File "/gpt-neox/megatron/training.py", line 82, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer( File "/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer model, optimizer, _, lr_scheduler = deepspeed.initialize( File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 128, in initialize engine = PipelineEngine(args=args, File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 60, in __init__ super().__init__(*super_args, **super_kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 192, in __init__ self._configure_optimizer(optimizer, model_parameters) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer self.optimizer = self._configure_zero_optimizer(basic_optimizer) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer optimizer = FP16_DeepSpeedZeroOptimizer( File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__ self.initialize_optimizer_states() File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states self.optimizer.step() File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 159, in step multi_tensor_applier(self.multi_tensor_adam, File "/usr/local/lib/python3.8/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__ return op(self.chunk_size, RuntimeError: expected input to be on cuda Traceback (most recent call last): File "train.py", line 27, in pretrain(neox_args=neox_args) File "/gpt-neox/megatron/training.py", line 82, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer( File "/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer model, optimizer, _, lr_scheduler = deepspeed.initialize( File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 128, in initialize engine = PipelineEngine(args=args, File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 60, in __init__ super().__init__(*super_args, **super_kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 192, in __init__ self._configure_optimizer(optimizer, model_parameters) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer self.optimizer = self._configure_zero_optimizer(basic_optimizer) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer optimizer = FP16_DeepSpeedZeroOptimizer( File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__ self.initialize_optimizer_states() File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states self.optimizer.step() File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 159, in step multi_tensor_applier(self.multi_tensor_adam, File "/usr/local/lib/python3.8/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__ return op(self.chunk_size, RuntimeError: expected input to be on cuda Killing subprocess 4252 Killing subprocess 4253 Killing subprocess 4254 Killing subprocess 4255 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 179, in main() File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 169, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'train.py', '--local_rank=3', '--deepspeed_config', '{"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true}', '--megatron_config', '{"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 12, "hidden_size": 768, "num_attention_heads": 12, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "lr_decay_style": "cosine", "lr_decay_iters": 320000, "optimizer_type": "Adam", "zero_stage": 2, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.0006, "data_path": "data/enron/enron_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"small_offload.yml": "# GPT-2 pretraining setup\\n{\\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\\n # across the node boundaries )\\n \\"pipe-parallel-size\\": 1,\\n \\"model-parallel-size\\": 1,\\n\\n # model settings\\n \\"num-layers\\": 12,\\n \\"hidden-size\\": 768,\\n \\"num-attention-heads\\": 12,\\n \\"seq-length\\": 2048,\\n \\"max-position-embeddings\\": 2048,\\n \\"norm\\": \\"layernorm\\",\\n \\"pos-emb\\": \\"rotary\\",\\n \\"no-weight-tying\\": true,\\n\\n # these should provide some speedup but takes a while to build, set to true if desired\\n \\"scaled-upper-triang-masked-softmax-fusion\\": false,\\n \\"bias-gelu-fusion\\": false,\\n\\n\\n # optimizer settings\\n \\"optimizer\\": {\\n \\"type\\": \\"Adam\\",\\n \\"params\\": {\\n \\"lr\\": 0.0006,\\n \\"betas\\": [0.9, 0.999],\\n \\"eps\\": 1.0e-8,\\n }\\n },\\n \\"zero_optimization\\": {\\n \\"stage\\": 2,\\n \\"allgather_partitions\\": True,\\n \\"allgather_bucket_size\\": 500000000,\\n \\"overlap_comm\\": True,\\n \\"reduce_scatter\\": True,\\n \\"reduce_bucket_size\\": 500000000,\\n \\"contiguous_gradients\\": True,\\n \\"cpu_offload\\": True\\n },\\n\\n # batch / data settings\\n \\"train_micro_batch_size_per_gpu\\": 4,\\n \\"data-impl\\": \\"mmap\\",\\n \\"split\\": \\"949,50,1\\",\\n\\n # activation checkpointing\\n \\"checkpoint-activations\\": true,\\n \\"checkpoint-num-layers\\": 1,\\n \\"partition-activations\\": true,\\n \\"synchronize-each-layer\\": true,\\n\\n # regularization\\n \\"gradient_clipping\\": 1.0,\\n \\"weight-decay\\": 0.0,\\n \\"hidden-dropout\\": 0.0,\\n \\"attention-dropout\\": 0.0,\\n\\n # precision settings\\n \\"fp16\\": { \\n \\"enabled\\": true,\\n \\"loss_scale\\": 0,\\n \\"loss_scale_window\\": 1000,\\n \\"hysteresis\\": 2,\\n \\"min_loss_scale\\": 1\\n },\\n\\n # misc. training settings\\n \\"train-iters\\": 320000,\\n \\"lr-decay-iters\\": 320000,\\n \\"distributed-backend\\": \\"nccl\\",\\n \\"lr-decay-style\\": \\"cosine\\",\\n \\"warmup\\": 0.01,\\n \\"save-interval\\": 10000,\\n \\"eval-interval\\": 1000,\\n \\"eval-iters\\": 10,\\n\\n # logging\\n \\"log-interval\\": 100,\\n \\"steps_per_print\\": 10,\\n \\"keep-last-n-checkpoints\\": 4,\\n \\"wall_clock_breakdown\\": true,\\n}\\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\\n{\\n \\"data-path\\": \\"data/enron/enron_text_document\\",\\n \\n # or for weighted datasets: \\n # \\"train-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n # \\"test-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n # \\"valid-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n # \\"train-data-weights\\": [1., 2.],\\n # \\"test-data-weights\\": [2., 1.],\\n # \\"valid-data-weights\\": [0.5, 0.4],\\n\\n # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \\n # WARNING: setting this to True will override any user provided weights\\n # \\"weight_by_num_documents\\": false,\\n # \\"weighted_sampler_alpha\\": 0.3,\\n\\n \\"vocab-file\\": \\"data/gpt2-vocab.json\\",\\n \\"merge-file\\": \\"data/gpt2-merges.txt\\",\\n\\n \\"save\\": \\"checkpoints\\",\\n \\"load\\": \\"checkpoints\\",\\n \\"checkpoint_validation_with_forward_pass\\": False,\\n \\n \\"tensorboard-dir\\": \\"tensorboard\\",\\n \\"log-dir\\": \\"logs\\",\\n \\"use_wandb\\": True,\\n \\"wandb_host\\": \\"https://api.wandb.ai\\",\\n \\"wandb_project\\": \\"neox\\"\\n}"}, "load": "checkpoints", "save_interval": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "split": "949,50,1", "vocab_file": "data/gpt2-vocab.json", "merge_file": "data/gpt2-merges.txt", "attention_dropout": 0.0, "hidden_dropout": 0.0, "weight_decay": 0.0, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 1, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 1, "is_pipe_parallel": true, "use_wandb": true, "wandb_group": "QZUNHRmPyM5gy9C2ReHsUj_1fcswh3e", "log_dir": "logs", "tensorboard_dir": "tensorboard", "log_interval": 100, "user_script": "train.py", "global_num_gpus": 4}']' returned non-zero exit status 1.
I don't have time right now, but in the next couple weeks I might try to make a simple feed forward network, then I can wrap it in DeeperSpeed ZeRO 2 with CPU offload to try and isolate the issue to either GPT-NeoX or DeeperSpeed.
Has cpu_offload
been confirmed to work in the past with GPT-NeoX, and do you know what the known working configs are? If I know there was a config that worked in the past, it'll serve as a good starting point for debugging.
@pwstegman If you set pipe-parallel-size=0
what happens? With PP = 1, it’s supposed to initialize as a PP module and then convert to a non-pipeline model, but I vaguely recall there being a bug and that not happening.
Pipeline parallelism and ZeRO 2 are incompatible, so if it’s a one-stage pipeline module that could cause the problem.
The printout says it’s a pipeline module, but off the top of my head I’m not 100% sure that we put the print statement in the right place relative to the casting to non-pipeline :P
Also, just to confirm, the configs you’re describing are a minimal example or a test to make sure things work right? This is a very poor way to actually train a 125M model on your hardware. You should not need CPU offload unless you’re trying to train a model with at least 40B parameters.
Pipeline parallelism and Zero 2/3 are not compatible then? How would one train a large model say 20B parameters without pipeline parallelism while using CPU offload, even if just a single step? Using Zero 2/3 I have encountered OOM issues with a 3090 on smaller models, though still higher than without using those higher-level Zero, not on this repo per say, but in other areas such as finetuning GPTJ.
I have finetuned GPTJ on GPUs using CPU offloading and a 3090, but trying to train a model with the 6B config here without offloading leads to OOM issues.
I have encountered the same issue as this task highlights. "RuntimeError: expected input to be on cuda", When trying to enable CPU offloading.
Without changing pipeline parallelism from 1 to 0, my RAM usage increases as if properly working before encountering that issue. After changing to 0, the RAM does not increase as expected and I get an OOM error promptly.
@StellaAthena Just tried setting pipeline parallelism to 0 (model parallel was 1) and got the same error.
NeoXArgs.from_ymls() ['configs/small_offload.yml', 'configs/local_setup.yml']
INFO:root:NeoXArgs.calculate_derived() Total number of GPUs determined to be: 4
-------------------- arguments --------------------
attention_config ................ ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global']updated
attention_dropout ............... 0.0.........................updated
batch_size ...................... 4...........................updated
checkpoint_activations .......... True........................updated
clip_grad ....................... 1.0.........................updated
config_files .................... {'small_offload.yml': '# GPT-2 pretraining setup\n{\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n # across the node boundaries )\n "pipe-parallel-size": 0,\n "model-parallel-size": 1,\n\n # model settings\n "num-layers": 12,\n "hidden-size": 768,\n "num-attention-heads": 12,\n "seq-length": 2048,\n "max-position-embeddings": 2048,\n "norm": "layernorm",\n "pos-emb": "rotary",\n "no-weight-tying": true,\n\n # these should provide some speedup but takes a while to build, set to true if desired\n "scaled-upper-triang-masked-softmax-fusion": false,\n "bias-gelu-fusion": false,\n\n\n # optimizer settings\n "optimizer": {\n "type": "Adam",\n "params": {\n "lr": 0.0006,\n "betas": [0.9, 0.999],\n "eps": 1.0e-8,\n }\n },\n "zero_optimization": {\n "stage": 2,\n "allgather_partitions": True,\n "allgather_bucket_size": 500000000,\n "overlap_comm": True,\n "reduce_scatter": True,\n "reduce_bucket_size": 500000000,\n "contiguous_gradients": True,\n "cpu_offload": True\n },\n\n # batch / data settings\n "train_micro_batch_size_per_gpu": 4,\n "data-impl": "mmap",\n "split": "949,50,1",\n\n # activation checkpointing\n "checkpoint-activations": true,\n "checkpoint-num-layers": 1,\n "partition-activations": true,\n "synchronize-each-layer": true,\n\n # regularization\n "gradient_clipping": 1.0,\n "weight-decay": 0.0,\n "hidden-dropout": 0.0,\n "attention-dropout": 0.0,\n\n # precision settings\n "fp16": { \n "enabled": true,\n "loss_scale": 0,\n "loss_scale_window": 1000,\n "hysteresis": 2,\n "min_loss_scale": 1\n },\n\n # misc. training settings\n "train-iters": 320000,\n "lr-decay-iters": 320000,\n "distributed-backend": "nccl",\n "lr-decay-style": "cosine",\n "warmup": 0.01,\n "save-interval": 10000,\n "eval-interval": 1000,\n "eval-iters": 10,\n\n # logging\n "log-interval": 100,\n "steps_per_print": 10,\n "keep-last-n-checkpoints": 4,\n "wall_clock_breakdown": true,\n}\n', 'local_setup.yml': '# Suggested data paths when using GPT-NeoX locally\n{\n "data-path": "data/enron/enron_text_document",\n \n # or for weighted datasets: \n # "train-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n # "test-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n # "valid-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n # "train-data-weights": [1., 2.],\n # "test-data-weights": [2., 1.],\n # "valid-data-weights": [0.5, 0.4],\n\n # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n # WARNING: setting this to True will override any user provided weights\n # "weight_by_num_documents": false,\n # "weighted_sampler_alpha": 0.3,\n\n "vocab-file": "data/gpt2-vocab.json",\n "merge-file": "data/gpt2-merges.txt",\n\n "save": "checkpoints",\n "load": "checkpoints",\n "checkpoint_validation_with_forward_pass": False,\n \n "tensorboard-dir": "tensorboard",\n "log-dir": "logs",\n "use_wandb": True,\n "wandb_host": "https://api.wandb.ai",\n "wandb_project": "neox"\n}'}updated
data_impl ....................... mmap........................updated
data_path ....................... data/enron/enron_text_documentupdated
dynamic_loss_scale .............. True........................updated
eval_iters ...................... 10..........................updated
fp16 ............................ {'enabled': True, 'loss_scale': 0, 'loss_scale_window': 1000, 'hysteresis': 2, 'min_loss_scale': 1}updated
gas ............................. 1...........................updated
global_num_gpus ................. 4...........................updated
gradient_clipping ............... 1.0.........................updated
hidden_dropout .................. 0.0.........................updated
hidden_size ..................... 768.........................updated
keep_last_n_checkpoints ......... 4...........................updated
load ............................ checkpoints.................updated
log_dir ......................... logs........................updated
log_interval .................... 100.........................updated
lr .............................. 0.0006......................updated
lr_decay_iters .................. 320000......................updated
lr_decay_style .................. cosine......................updated
max_position_embeddings ......... 2048........................updated
merge_file ...................... data/gpt2-merges.txt........updated
no_weight_tying ................. True........................updated
num_attention_heads ............. 12..........................updated
num_layers ...................... 12..........................updated
optimizer ....................... {'type': 'Adam', 'params': {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}}updated
optimizer_type .................. Adam........................updated
partition_activations ........... True........................updated
pos_emb ......................... rotary......................updated
precision ....................... fp16........................updated
save ............................ checkpoints.................updated
save_interval ................... 10000.......................updated
seq_length ...................... 2048........................updated
sparsity_config ................. {}..........................updated
split ........................... 949,50,1....................updated
synchronize_each_layer .......... True........................updated
tensorboard_dir ................. tensorboard.................updated
train_batch_size ................ 16..........................updated
train_iters ..................... 320000......................updated
train_micro_batch_size_per_gpu .. 4...........................updated
use_wandb ....................... True........................updated
user_script ..................... train.py....................updated
vocab_file ...................... data/gpt2-vocab.json........updated
wall_clock_breakdown ............ True........................updated
wandb_group ..................... 2JTahMMU4myukKqf7nUqJp_35tzwh2kupdated
weight_decay .................... 0.0.........................updated
zero_allgather_bucket_size ...... 500000000...................updated
zero_contiguous_gradients ....... True........................updated
zero_optimization ............... {'stage': 2, 'allgather_partitions': True, 'allgather_bucket_size': 500000000, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 500000000, 'contiguous_gradients': True, 'cpu_offload': True}updated
zero_reduce_bucket_size ......... 500000000...................updated
zero_reduce_scatter ............. True........................updated
zero_stage ...................... 2...........................updated
activation ...................... gelu........................default
adlr_autoresume ................. False.......................default
adlr_autoresume_interval ........ 1000........................default
amp ............................. None........................default
apply_query_key_layer_scaling ... False.......................default
attention_softmax_in_fp32 ....... False.......................default
bias_dropout_fusion ............. False.......................default
bias_gelu_fusion ................ False.......................default
char_level_ppl .................. False.......................default
checkpoint_in_cpu ............... False.......................default
checkpoint_num_layers ........... 1...........................default
checkpoint_validation_with_forward_pass False................default
contiguous_checkpointing ........ False.......................default
deepscale ....................... False.......................default
deepscale_config ................ None........................default
deepspeed ....................... True........................default
deepspeed_activation_checkpointing True......................default
deepspeed_mpi ................... False.......................default
detect_nvlink_pairs ............. False.......................default
distributed_backend ............. nccl........................default
do_test ......................... None........................default
do_train ........................ None........................default
do_valid ........................ None........................default
dump_state ...................... False.......................default
eod_mask_loss ................... False.......................default
eval_interval ................... 1000........................default
eval_results_prefix ............. ............................default
eval_tasks ...................... None........................default
exclude ......................... None........................default
exit_interval ................... None........................default
finetune ........................ False.......................default
flops_profiler .................. None........................default
fp16_lm_cross_entropy ........... False.......................default
fp32_allreduce .................. False.......................default
git_hash ........................ 49e60fe.....................default
gmlp_attn_dim ................... 64..........................default
gpt_j_residual .................. False.......................default
gradient_accumulation_steps ..... 1...........................default
gradient_noise_scale_cpu_offload False.......................default
gradient_noise_scale_n_batches .. 5...........................default
gradient_predivide_factor ....... 1.0.........................default
hostfile ........................ None........................default
hysteresis ...................... 2...........................default
include ......................... None........................default
init_method ..................... normal......................default
init_method_std ................. 0.02........................default
is_pipe_parallel ................ False.......................default
iteration ....................... None........................default
launcher ........................ pdsh........................default
layernorm_epsilon ............... 1e-05.......................default
lazy_mpu_init ................... False.......................default
local_rank ...................... None........................default
log_grad_norm ................... False.......................default
log_gradient_noise_scale ........ False.......................default
log_optimizer_states ............ False.......................default
log_param_norm .................. False.......................default
loss_scale ...................... None........................default
loss_scale_window ............... 1000.0......................default
make_vocab_size_divisible_by .... 128.........................default
master_addr ..................... None........................default
master_port ..................... 29500.......................default
maximum_tokens .................. 64..........................default
min_lr .......................... 0.0.........................default
min_scale ....................... 1.0.........................default
mmap_warmup ..................... False.......................default
model_parallel_size ............. 1...........................default
no_load_optim ................... False.......................default
no_load_rng ..................... False.......................default
no_save_optim ................... False.......................default
no_save_rng ..................... False.......................default
norm ............................ layernorm...................default
num_gpus ........................ None........................default
num_nodes ....................... -1..........................default
num_samples ..................... 0...........................default
num_unique_layers ............... None........................default
num_workers ..................... 2...........................default
onnx_safe ....................... False.......................default
output_layer_init_method ........ scaled_normal...............default
output_layer_parallelism ........ row.........................default
override_lr_scheduler ........... False.......................default
padded_vocab_size ............... None........................default
param_sharing_style ............. grouped.....................default
pipe_parallel_size .............. 0...........................default
pipe_partition_method ........... type:transformer|mlp........default
prescale_gradients .............. False.......................default
profile_backward ................ False.......................default
rank ............................ None........................default
recompute ....................... False.......................default
rms_norm_epsilon ................ 1e-08.......................default
rotary_emb_base ................. 10000.......................default
rotary_pct ...................... 1.0.........................default
rpe_max_distance ................ 128.........................default
rpe_num_buckets ................. 32..........................default
sample_input_file ............... None........................default
sample_output_file .............. None........................default
scaled_masked_softmax_fusion .... False.......................default
scaled_upper_triang_masked_softmax_fusion False..............default
scalenorm_epsilon ............... 1e-08.......................default
scheduler ....................... None........................default
seed ............................ 1234........................default
short_seq_prob .................. 0.1.........................default
soft_prompt_tuning .............. None........................default
sparse_gradients ................ False.......................default
steps_per_print ................. 10..........................default
temperature ..................... 0.0.........................default
test_data_paths ................. None........................default
test_data_weights ............... None........................default
text_gen_type ................... None........................default
tokenizer_type .................. GPT2BPETokenizer............default
top_k ........................... 0...........................default
top_p ........................... 0.0.........................default
train_data_paths ................ None........................default
train_data_weights .............. None........................default
use_bnb_optimizer ............... False.......................default
use_checkpoint_lr_scheduler ..... False.......................default
use_cpu_initialization .......... False.......................default
valid_data_paths ................ None........................default
valid_data_weights .............. None........................default
wandb_host ...................... https://api.wandb.ai........default
wandb_project ................... neox........................default
wandb_team ...................... None........................default
warmup .......................... 0.01........................default
weight_by_num_documents ......... False.......................default
weighted_sampler_alpha .......... 0.3.........................default
world_size ...................... None........................default
zero_allow_untested_optimizer ... False.......................default
---------------- end of arguments ----------------
[2021-12-15 02:09:34,263] [WARNING] [runner.py:126:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-12-15 02:09:34,263] [INFO] [runner.py:366:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 train.py --deepspeed_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true} --megatron_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 12, "hidden_size": 768, "num_attention_heads": 12, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "lr_decay_style": "cosine", "lr_decay_iters": 320000, "optimizer_type": "Adam", "zero_stage": 2, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.0006, "data_path": "data/enron/enron_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"small_offload.yml": "# GPT-2 pretraining setup\n{\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n # across the node boundaries )\n \"pipe-parallel-size\": 0,\n \"model-parallel-size\": 1,\n\n # model settings\n \"num-layers\": 12,\n \"hidden-size\": 768,\n \"num-attention-heads\": 12,\n \"seq-length\": 2048,\n \"max-position-embeddings\": 2048,\n \"norm\": \"layernorm\",\n \"pos-emb\": \"rotary\",\n \"no-weight-tying\": true,\n\n # these should provide some speedup but takes a while to build, set to true if desired\n \"scaled-upper-triang-masked-softmax-fusion\": false,\n \"bias-gelu-fusion\": false,\n\n\n # optimizer settings\n \"optimizer\": {\n \"type\": \"Adam\",\n \"params\": {\n \"lr\": 0.0006,\n \"betas\": [0.9, 0.999],\n \"eps\": 1.0e-8,\n }\n },\n \"zero_optimization\": {\n \"stage\": 2,\n \"allgather_partitions\": True,\n \"allgather_bucket_size\": 500000000,\n \"overlap_comm\": True,\n \"reduce_scatter\": True,\n \"reduce_bucket_size\": 500000000,\n \"contiguous_gradients\": True,\n \"cpu_offload\": True\n },\n\n # batch / data settings\n \"train_micro_batch_size_per_gpu\": 4,\n \"data-impl\": \"mmap\",\n \"split\": \"949,50,1\",\n\n # activation checkpointing\n \"checkpoint-activations\": true,\n \"checkpoint-num-layers\": 1,\n \"partition-activations\": true,\n \"synchronize-each-layer\": true,\n\n # regularization\n \"gradient_clipping\": 1.0,\n \"weight-decay\": 0.0,\n \"hidden-dropout\": 0.0,\n \"attention-dropout\": 0.0,\n\n # precision settings\n \"fp16\": { \n \"enabled\": true,\n \"loss_scale\": 0,\n \"loss_scale_window\": 1000,\n \"hysteresis\": 2,\n \"min_loss_scale\": 1\n },\n\n # misc. training settings\n \"train-iters\": 320000,\n \"lr-decay-iters\": 320000,\n \"distributed-backend\": \"nccl\",\n \"lr-decay-style\": \"cosine\",\n \"warmup\": 0.01,\n \"save-interval\": 10000,\n \"eval-interval\": 1000,\n \"eval-iters\": 10,\n\n # logging\n \"log-interval\": 100,\n \"steps_per_print\": 10,\n \"keep-last-n-checkpoints\": 4,\n \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n \"data-path\": \"data/enron/enron_text_document\",\n \n # or for weighted datasets: \n # \"train-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"test-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"valid-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"train-data-weights\": [1., 2.],\n # \"test-data-weights\": [2., 1.],\n # \"valid-data-weights\": [0.5, 0.4],\n\n # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n # WARNING: setting this to True will override any user provided weights\n # \"weight_by_num_documents\": false,\n # \"weighted_sampler_alpha\": 0.3,\n\n \"vocab-file\": \"data/gpt2-vocab.json\",\n \"merge-file\": \"data/gpt2-merges.txt\",\n\n \"save\": \"checkpoints\",\n \"load\": \"checkpoints\",\n \"checkpoint_validation_with_forward_pass\": False,\n \n \"tensorboard-dir\": \"tensorboard\",\n \"log-dir\": \"logs\",\n \"use_wandb\": True,\n \"wandb_host\": \"https://api.wandb.ai\",\n \"wandb_project\": \"neox\"\n}"}, "load": "checkpoints", "save_interval": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "split": "949,50,1", "vocab_file": "data/gpt2-vocab.json", "merge_file": "data/gpt2-merges.txt", "attention_dropout": 0.0, "hidden_dropout": 0.0, "weight_decay": 0.0, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 1, "clip_grad": 1.0, "dynamic_loss_scale": true, "use_wandb": true, "wandb_group": "2JTahMMU4myukKqf7nUqJp_35tzwh2k", "log_dir": "logs", "tensorboard_dir": "tensorboard", "log_interval": 100, "user_script": "train.py", "global_num_gpus": 4}
[2021-12-15 02:09:35,128] [INFO] [launch.py:82:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2021-12-15 02:09:35,128] [INFO] [launch.py:88:main] nnodes=1, num_local_procs=4, node_rank=0
[2021-12-15 02:09:35,128] [INFO] [launch.py:103:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3]})
[2021-12-15 02:09:35,128] [INFO] [launch.py:104:main] dist_world_size=4
[2021-12-15 02:09:35,128] [INFO] [launch.py:112:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
NeoXArgs.configure_distributed_args() using world size: 4 and model-parallel size: 1
> building GPT2BPETokenizer tokenizer ...
[2021-12-15 02:09:38,268] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-12-15 02:09:38,271] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
> padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
WARNING: TensorBoard writing requested but is not available (are you using PyTorch 1.1.0 or later and do you have tensorboard installed?), no TensorBoard logs will be written.
> initializing torch distributed ...
[2021-12-15 02:09:38,306] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-12-15 02:09:38,312] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
> initializing model parallel with size 1
MPU DP: [0, 1, 2, 3]
MPU PP: [0]
MPU PP: [1]
MPU PP: [2]
MPU PP: [3]
MPU MP: [0]
MPU MP: [1]
MPU MP: [2]
MPU MP: [3]
> setting random seeds to 1234 ...
[2021-12-15 02:09:39,355] [INFO] [checkpointing.py:223:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/gpt-neox/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/gpt-neox/megatron/data'
building GPT2 model ...
SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1, ProcessCoord(pipe=0, data=2, model=0): 2, ProcessCoord(pipe=0, data=3, model=0): 3}
[2021-12-15 02:09:43,829] [INFO] [module.py:363:_partition_layers] Partitioning pipeline stages with method type:transformer|mlp
stage=0 layers=17
0: EmbeddingPipe
1: _pre_transformer_block
2: ParallelTransformerLayerPipe
3: ParallelTransformerLayerPipe
4: ParallelTransformerLayerPipe
5: ParallelTransformerLayerPipe
6: ParallelTransformerLayerPipe
7: ParallelTransformerLayerPipe
8: ParallelTransformerLayerPipe
9: ParallelTransformerLayerPipe
10: ParallelTransformerLayerPipe
11: ParallelTransformerLayerPipe
12: ParallelTransformerLayerPipe
13: ParallelTransformerLayerPipe
14: _post_transformer_block
15: NormPipe
16: ParallelLinearPipe
loss: partial
[2021-12-15 02:09:43,880] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-12-15 02:09:43,881] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
Configuring Optimizer type: Adam with params: {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}
> learning rate decay style: cosine
DeepSpeed is enabled.
[2021-12-15 02:09:43,883] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.15+eb7f5cf, git-hash=eb7f5cf, git-branch=main
[2021-12-15 02:09:43,883] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-12-15 02:09:43,883] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-12-15 02:09:45,231] [INFO] [engine.py:654:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2021-12-15 02:09:45,231] [INFO] [engine.py:659:_configure_optimizer] Using client Optimizer as basic optimizer
[2021-12-15 02:09:45,231] [INFO] [engine.py:668:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam
Checking ZeRO support for optimizer=FusedAdam type=
[2021-12-15 02:09:45,231] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2021-12-15 02:09:45,231] [INFO] [stage2.py:102:__init__] Reduce bucket size 500000000
[2021-12-15 02:09:45,231] [INFO] [stage2.py:103:__init__] Allgather bucket size 500000000
[2021-12-15 02:09:45,232] [INFO] [stage2.py:104:__init__] CPU Offload: True
Using /root/.cache/torch_extensions as PyTorch extensions root...
Using /root/.cache/torch_extensions as PyTorch extensions root...
Using /root/.cache/torch_extensions as PyTorch extensions root...
Using /root/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.3600890636444092 seconds
Loading extension module utils...
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 0.40231990814208984 seconds
Time to load utils op: 0.4020683765411377 seconds
Time to load utils op: 0.40213465690612793 seconds
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/gpt-neox/megatron/training.py", line 82, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 116, in initialize
engine = DeepSpeedEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 192, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer
optimizer = FP16_DeepSpeedZeroOptimizer(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__
self.initialize_optimizer_states()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states
self.optimizer.step()
File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 159, in step
multi_tensor_applier(self.multi_tensor_adam,
File "/usr/local/lib/python3.8/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
return op(self.chunk_size,
RuntimeError: expected input to be on cuda
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/gpt-neox/megatron/training.py", line 82, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 116, in initialize
engine = DeepSpeedEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 192, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer
optimizer = FP16_DeepSpeedZeroOptimizer(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__
self.initialize_optimizer_states()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states
self.optimizer.step()
File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 159, in step
multi_tensor_applier(self.multi_tensor_adam,
File "/usr/local/lib/python3.8/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
return op(self.chunk_size,
RuntimeError: expected input to be on cuda
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/gpt-neox/megatron/training.py", line 82, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 116, in initialize
engine = DeepSpeedEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 192, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer
optimizer = FP16_DeepSpeedZeroOptimizer(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__
self.initialize_optimizer_states()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states
self.optimizer.step()
File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 159, in step
multi_tensor_applier(self.multi_tensor_adam,
File "/usr/local/lib/python3.8/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
return op(self.chunk_size,
RuntimeError: expected input to be on cuda
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/gpt-neox/megatron/training.py", line 82, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/gpt-neox/megatron/training.py", line 399, in setup_model_and_optimizer
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 116, in initialize
engine = DeepSpeedEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 192, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 681, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 824, in _configure_zero_optimizer
optimizer = FP16_DeepSpeedZeroOptimizer(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 373, in __init__
self.initialize_optimizer_states()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 399, in initialize_optimizer_states
self.optimizer.step()
File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 159, in step
multi_tensor_applier(self.multi_tensor_adam,
File "/usr/local/lib/python3.8/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
return op(self.chunk_size,
RuntimeError: expected input to be on cuda
Killing subprocess 3914
Killing subprocess 3915
Killing subprocess 3916
Killing subprocess 3917
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 179, in
main()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 169, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'train.py', '--local_rank=3', '--deepspeed_config', '{"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true}', '--megatron_config', '{"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 12, "hidden_size": 768, "num_attention_heads": 12, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "lr_decay_style": "cosine", "lr_decay_iters": 320000, "optimizer_type": "Adam", "zero_stage": 2, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.0006, "data_path": "data/enron/enron_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"small_offload.yml": "# GPT-2 pretraining setup\\n{\\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\\n # across the node boundaries )\\n \\"pipe-parallel-size\\": 0,\\n \\"model-parallel-size\\": 1,\\n\\n # model settings\\n \\"num-layers\\": 12,\\n \\"hidden-size\\": 768,\\n \\"num-attention-heads\\": 12,\\n \\"seq-length\\": 2048,\\n \\"max-position-embeddings\\": 2048,\\n \\"norm\\": \\"layernorm\\",\\n \\"pos-emb\\": \\"rotary\\",\\n \\"no-weight-tying\\": true,\\n\\n # these should provide some speedup but takes a while to build, set to true if desired\\n \\"scaled-upper-triang-masked-softmax-fusion\\": false,\\n \\"bias-gelu-fusion\\": false,\\n\\n\\n # optimizer settings\\n \\"optimizer\\": {\\n \\"type\\": \\"Adam\\",\\n \\"params\\": {\\n \\"lr\\": 0.0006,\\n \\"betas\\": [0.9, 0.999],\\n \\"eps\\": 1.0e-8,\\n }\\n },\\n \\"zero_optimization\\": {\\n \\"stage\\": 2,\\n \\"allgather_partitions\\": True,\\n \\"allgather_bucket_size\\": 500000000,\\n \\"overlap_comm\\": True,\\n \\"reduce_scatter\\": True,\\n \\"reduce_bucket_size\\": 500000000,\\n \\"contiguous_gradients\\": True,\\n \\"cpu_offload\\": True\\n },\\n\\n # batch / data settings\\n \\"train_micro_batch_size_per_gpu\\": 4,\\n \\"data-impl\\": \\"mmap\\",\\n \\"split\\": \\"949,50,1\\",\\n\\n # activation checkpointing\\n \\"checkpoint-activations\\": true,\\n \\"checkpoint-num-layers\\": 1,\\n \\"partition-activations\\": true,\\n \\"synchronize-each-layer\\": true,\\n\\n # regularization\\n \\"gradient_clipping\\": 1.0,\\n \\"weight-decay\\": 0.0,\\n \\"hidden-dropout\\": 0.0,\\n \\"attention-dropout\\": 0.0,\\n\\n # precision settings\\n \\"fp16\\": { \\n \\"enabled\\": true,\\n \\"loss_scale\\": 0,\\n \\"loss_scale_window\\": 1000,\\n \\"hysteresis\\": 2,\\n \\"min_loss_scale\\": 1\\n },\\n\\n # misc. training settings\\n \\"train-iters\\": 320000,\\n \\"lr-decay-iters\\": 320000,\\n \\"distributed-backend\\": \\"nccl\\",\\n \\"lr-decay-style\\": \\"cosine\\",\\n \\"warmup\\": 0.01,\\n \\"save-interval\\": 10000,\\n \\"eval-interval\\": 1000,\\n \\"eval-iters\\": 10,\\n\\n # logging\\n \\"log-interval\\": 100,\\n \\"steps_per_print\\": 10,\\n \\"keep-last-n-checkpoints\\": 4,\\n \\"wall_clock_breakdown\\": true,\\n}\\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\\n{\\n \\"data-path\\": \\"data/enron/enron_text_document\\",\\n \\n # or for weighted datasets: \\n # \\"train-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n # \\"test-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n # \\"valid-data-paths\\": [\\"data/enron/enron_text_document\\", \\"data/enron/enron_text_document\\"],\\n # \\"train-data-weights\\": [1., 2.],\\n # \\"test-data-weights\\": [2., 1.],\\n # \\"valid-data-weights\\": [0.5, 0.4],\\n\\n # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \\n # WARNING: setting this to True will override any user provided weights\\n # \\"weight_by_num_documents\\": false,\\n # \\"weighted_sampler_alpha\\": 0.3,\\n\\n \\"vocab-file\\": \\"data/gpt2-vocab.json\\",\\n \\"merge-file\\": \\"data/gpt2-merges.txt\\",\\n\\n \\"save\\": \\"checkpoints\\",\\n \\"load\\": \\"checkpoints\\",\\n \\"checkpoint_validation_with_forward_pass\\": False,\\n \\n \\"tensorboard-dir\\": \\"tensorboard\\",\\n \\"log-dir\\": \\"logs\\",\\n \\"use_wandb\\": True,\\n \\"wandb_host\\": \\"https://api.wandb.ai\\",\\n \\"wandb_project\\": \\"neox\\"\\n}"}, "load": "checkpoints", "save_interval": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "split": "949,50,1", "vocab_file": "data/gpt2-vocab.json", "merge_file": "data/gpt2-merges.txt", "attention_dropout": 0.0, "hidden_dropout": 0.0, "weight_decay": 0.0, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 1, "clip_grad": 1.0, "dynamic_loss_scale": true, "use_wandb": true, "wandb_group": "2JTahMMU4myukKqf7nUqJp_35tzwh2k", "log_dir": "logs", "tensorboard_dir": "tensorboard", "log_interval": 100, "user_script": "train.py", "global_num_gpus": 4}']' returned non-zero exit status 1.
Appreciate the heads up. Yeah, I'm just using the small config to debug CPU offload. I don't intend to use it to train a model that small.
CPU offload has worked in the past with model parallel on the latest DeepSpeed using https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM-v1.1.5-ZeRO3. I don't have steps to reproduce on me right now.
@mallorbc You can use model parallelism, which does work with cpu offload (at least, it should once we sort this issue out). It also divides the model out across multiple GPUs and is even more memory efficient than pipeline parallelism (at the cost of computation time). (source: https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/, "Understand the tradeoffs of data, model, and pipeline parallelism" section)
Describe the bug
When starting with
small.yml
, then changing ZeRO to 2 andcpu_offload
totrue
, I get the following error:Similar to https://github.com/EleutherAI/gpt-neox/issues/171.
Full Traceback
To Reproduce
Starting on a host with NVIDIA driver 460.91.03, I ran a new docker container using image
nvidia/cuda:11.1.1-devel-ubuntu18.04
. I created a new Python 3.8.0 virtual env, installed torch 1.8.2+cu111, installed the latest NVIDIA apex, then installed all the requirements in each file within the requirements directory of GPT-NeoX.I changed two lines in the
small.yml
config in the"zero_optimization"
section:"stage": 2
"cpu_offload": True
Then I ran:
This caused the mentioned error.
Expected behavior
The optimizer should be offloaded to the CPU.
Proposed solution
No proposed solutions yet. Are there any known working
cpu_offload
configs? I can start there to try and debug this.Screenshots
None
Environment (please complete the following information):
small.yml
(modified)local_setup_small.yml
Additional context
None