argonne-lcf / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
9 stars 12 forks source link

Add `LLAMA_MODE` toggle #28

Closed saforem2 closed 4 months ago

saforem2 commented 4 months ago

Adds ability to toggle "${LLAMA_ARGS}" on / off by setting:

NO_LLAMA=1 bash train_llama_alcf.sh

which will disable the "${LLAMA_ARGS}" that are built in the run_cmd from here

saforem2 commented 4 months ago

e.g. to reproduce the GPT 1.5B model:

PBS_O_WORKDIR=$(pwd) NO_LLAMA=1 TOKENIZER_TYPE="gpt" NLAYERS=48 HIDDEN=1536 SEQ=2048 HEADS=24 bash train_llama_alcf.sh

though, for some reason, this seems to crash only on the first attempt (?? 🤔) with:

[rank1]: ValueError: mmap length is greater than file size
  1. [Take 1]: running the first time seems to cause it to crash with:

    [rank1]: ValueError: mmap length is greater than file size
    traceback: ```bash # [06:21:21 PM][foremans@x3004c0s37b1n0][…/Megatron-DeepSpeed][🌱 llam][!?] (2024-04-29) $ PBS_O_WORKDIR=$(pwd) TOKENIZER_TYPE="gpt" NLAYERS=48 HIDDEN=1536 SEQ=2048 HEADS=24 GRAD_ACC_STEPS=4 NO_LLAMA=1 bash train_llama_alcf.sh # ...clipped... [2024-06-16 18:21:57][INFO][training:1634] - > building train, validation, and test datasets ... [2024-06-16 18:21:57][INFO][training:1617] - > datasets target sizes (minimum size): [2024-06-16 18:21:57][INFO][training:1618] - train: 10172544 [2024-06-16 18:21:57][INFO][training:1619] - validation: 2240 [2024-06-16 18:21:57][INFO][training:1620] - test: 320 [2024-06-16 18:21:57][INFO][pretrain_gpt_alcf:497] - > building train, validation, and test datasets for GPT ... Single data path provided for train, valid & test > last epoch number of samples (719954) is larger than 80% of number of samples per epoch (787715), setting separate_last_epoch to False > last epoch number of samples (719954) is larger than 80% of number of samples per epoch (787715), setting separate_last_epoch to False > last epoch number of samples (719954) is larger than 80% of number of samples per epoch (787715), setting separate_last_epoch to False > last epoch number of samples (719954) is larger than 80% of number of samples per epoch (787715), setting separate_last_epoch to False > last epoch number of samples (719954) is larger than 80% of number of samples per epoch (787715), setting separate_last_epoch to False > building dataset index ... > last epoch number of samples (719954) is larger than 80% of number of samples per epoch (787715), setting separate_last_epoch to False > last epoch number of samples (719954) is larger than 80% of number of samples per epoch (787715), setting separate_last_epoch to False reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... /eagle/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed/dataset/BookCorpusDataset_text_document.bin creating memory view of numpy buffer... > finished creating indexed dataset in 0.002640 seconds number of documents: 17868 > dataset split: train: document indices in [0, 17868) total of 17868 documents validation: document indices in [17868, 17868) total of 0 documents test: document indices in [17868, 17868) total of 0 documents using: number of documents: 17868 number of epochs: 13 sequence length: 2048 total number of samples: 10240306 [2024-06-16 18:21:58][INFO][utils:246] - [0] > WARNING: could not find index map files, building > last epoch number of samples (719954) is larger than 80% of number of samples per epoch (787715), setting separate_last_epoch to False using: number of documents: 17868 number of epochs: 13 sequence length: 2048 total number of samples: 10240306 using: number of documents: 17868 number of epochs: 13 sequence length: 2048 total number of samples: 10240306 using: number of documents: 17868 number of epochs: 13 sequence length: 2048 total number of samples: 10240306 using: number of documents: 17868 number of epochs: 13 sequence length: 2048 total number of samples: 10240306 using: number of documents: 17868 number of epochs: 13 sequence length: 2048 total number of samples: 10240306 using: number of documents: 17868 number of epochs: 13 sequence length: 2048 total number of samples: 10240306 [2024-06-16 18:21:58][INFO][utils:246] - [0] > elasped time to build and save doc-idx mapping (seconds): 0.043984 using: number of documents: 17868 number of epochs: 13 sequence length: 2048 total number of samples: 10240306 > building shuffle index with split [0, 10240306) and [10240306, 10240306) ... [2024-06-16 18:21:59][INFO][utils:246] - [0] > elasped time to build and save sample-idx mapping (seconds): 0.946719 > building shuffle index with split [0, 10240306) and [10240306, 10240306) ... [rank2]: Traceback (most recent call last): [rank2]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/pretrain_gpt_alcf.py", line 635, in [rank2]: model = main() [rank2]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/pretrain_gpt_alcf.py", line 594, in main [rank2]: model = pretrain( [rank2]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/megatron/training.py", line 233, in pretrain [rank2]: = build_train_valid_test_data_iterators( [rank2]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/megatron/training.py", line 1695, in build_train_valid_test_data_iterators [rank2]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/megatron/training.py", line 1651, in build_train_valid_test_data_loaders [rank2]: train_ds, valid_ds, test_ds = build_train_valid_test_datasets( [rank2]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/megatron/training.py", line 1623, in build_train_valid_test_datasets [rank2]: return build_train_valid_test_datasets_provider(train_val_test_num_samples) [rank2]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/pretrain_gpt_alcf.py", line 517, in train_valid_test_datasets_provider [rank2]: train_ds, valid_ds, test_ds = build_train_valid_test_datasets( [rank2]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 34, in build_train_valid_test_datasets [rank2]: return _build_train_valid_test_datasets(data_prefix[0], [rank2]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 202, in _build_train_valid_test_datasets [rank2]: train_dataset = build_dataset(0, 'train') [rank2]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 194, in build_dataset [rank2]: dataset = GPTDataset(name, data_prefix, documents, indexed_dataset, [rank2]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 355, in __init__ [rank2]: _build_index_mappings(self.name, data_prefix, [rank2]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 611, in _build_index_mappings [rank2]: shuffle_idx = np.load(idx_path['shuffle'], allow_pickle=True, mmap_mode='r') [rank2]: File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages/numpy/lib/npyio.py", line 453, in load [rank2]: return format.open_memmap(file, mode=mmap_mode, [rank2]: File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages/numpy/lib/format.py", line 945, in open_memmap [rank2]: marray = numpy.memmap(filename, dtype=dtype, shape=shape, order=order, [rank2]: File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages/numpy/core/memmap.py", line 268, in __new__ [rank2]: mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start) [rank2]: ValueError: mmap length is greater than file size [2024-06-16 18:21:59][INFO][utils:246] - [0] > elasped time to build and save shuffle-idx mapping (seconds): 0.269242 [2024-06-16 18:21:59][INFO][utils:246] - [0] > loading doc-idx mapping from checkpoints/ws8_ds_stage1_nl48_hs1536_mb1_seq2048_gb32_sp1_pp1_tp1_fp16_optadamw_lr0.0003_lwf0.05_tokgpt_flash/69a345d3a236b6174094c13f4f13d1b6_doc_idx.npy [2024-06-16 18:21:59][INFO][utils:246] - [0] > loading sample-idx mapping from checkpoints/ws8_ds_stage1_nl48_hs1536_mb1_seq2048_gb32_sp1_pp1_tp1_fp16_optadamw_lr0.0003_lwf0.05_tokgpt_flash/69a345d3a236b6174094c13f4f13d1b6_sample_idx.npy [2024-06-16 18:21:59][INFO][utils:246] - [0] > loading shuffle-idx mapping from checkpoints/ws8_ds_stage1_nl48_hs1536_mb1_seq2048_gb32_sp1_pp1_tp1_fp16_optadamw_lr0.0003_lwf0.05_tokgpt_flash/69a345d3a236b6174094c13f4f13d1b6_shuffle_idx.npy [2024-06-16 18:21:59][INFO][utils:246] - [0] loaded indexed file in 0.017 seconds [2024-06-16 18:21:59][INFO][utils:246] - [0] total number of samples: 10240307 [2024-06-16 18:21:59][INFO][utils:246] - [0] total number of epochs: 13 [2024-06-16 18:21:59][INFO][pretrain_gpt_alcf:531] - > finished creating GPT datasets ... > building shuffle index with split [0, 10240306) and [10240306, 10240306) ... > building shuffle index with split [0, 10240306) and [10240306, 10240306) ... > building shuffle index with split [0, 10240306) and [10240306, 10240306) ... > building shuffle index with split [0, 10240306) and [10240306, 10240306) ... > building shuffle index with split [0, 10240306) and [10240306, 10240306) ... > building shuffle index with split [0, 10240306) and [10240306, 10240306) ... [rank3]: Traceback (most recent call last): [rank3]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/pretrain_gpt_alcf.py", line 635, in [rank3]: model = main() [rank3]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/pretrain_gpt_alcf.py", line 594, in main [rank3]: model = pretrain( [rank3]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/megatron/training.py", line 233, in pretrain [rank3]: = build_train_valid_test_data_iterators( [rank3]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/megatron/training.py", line 1695, in build_train_valid_test_data_iterators [rank3]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/megatron/training.py", line 1651, in build_train_valid_test_data_loaders [rank3]: train_ds, valid_ds, test_ds = build_train_valid_test_datasets( [rank3]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-Dee pSpeed/megatron/training.py", line 1623, in build_train_valid_test_datasets [rank3]: return build_train_valid_test_datasets_provider(train_val_test_num_samples) [rank3]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/pretrain_gpt_alcf.py", line 517, in train_valid_test_datasets_provider [rank3]: train_ds, valid_ds, test_ds = build_train_valid_test_datasets( [rank3]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 34, in build_train_valid_test_datasets [rank3]: return _build_train_valid_test_datasets(data_prefix[0], [rank3]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 202, in _build_train_valid_test_datasets [rank3]: train_dataset = build_dataset(0, 'train') [rank3]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 194, in build_dataset [rank3]: dataset = GPTDataset(name, data_prefix, documents, indexed_dataset, [rank3]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 355, in __init__ [rank3]: _build_index_mappings(self.name, data_prefix, [rank3]: File "/lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed-DistributedDataLoading/Megatron-DeepSpeed/megatron/data/gpt_dataset.py", line 611, in _build_index_mappings [rank3]: shuffle_idx = np.load(idx_path['shuffle'], allow_pickle=True, mmap_mode='r') [rank3]: File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages/numpy/lib/npyio.py", line 436, in load [rank3]: raise EOFError("No data left in file") [rank3]: EOFError: No data left in file ```
  2. [Take 2]: Re-running, immediately after (in the same session, with the exact same command), it then seems to work fine (??):

    traceback: ```bash # [06:23:55 PM][foremans@x3004c0s37b1n0][…/Megatron-DeepSpeed][🌱 llam][!?][⏱ 2m28s] (2024-04-29) $ PBS_O_WORKDIR=$(pwd) TOKENIZER_TYPE="gpt" NLAYERS=48 HIDDEN=1536 SEQ=2048 HEADS=24 GRAD_ACC_STEPS=4 NO_LLAMA=1 bash train_llama_alcf.sh # ...clipped... [2024-06-16 18:24:28][INFO][training:1634] - > building train, validation, and test datasets ... [2024-06-16 18:24:28][INFO][training:1617] - > datasets target sizes (minimum size): [2024-06-16 18:24:28][INFO][training:1618] - train: 10172544 [2024-06-16 18:24:28][INFO][training:1619] - validation: 2240 [2024-06-16 18:24:28][INFO][training:1620] - test: 320 [2024-06-16 18:24:28][INFO][pretrain_gpt_alcf:497] - > building train, validation, and test datasets for GPT ... Single data path provided for train, valid & test > building dataset index ... reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... /eagle/argonne_tpc/foremans/projects/argonne-lcf/Megatron-DeepSpeed/dataset/BookCorpusDataset_text_document.bin creating memory view of numpy buffer... > finished creating indexed dataset in 0.002635 seconds number of documents: 17868 > dataset split: train: document indices in [0, 17868) total of 17868 documents validation: document indices in [17868, 17868) total of 0 documents test: document indices in [17868, 17868) total of 0 documents [2024-06-16 18:24:28][INFO][utils:246] - [0] > loading doc-idx mapping from checkpoints/ws8_ds_stage1_nl48_hs1536_mb1_seq2048_gb32_sp1_pp1_tp1_fp16_optadamw_lr0.0003_lwf0.05_tokgpt_flash/69a345d3a236b6174094c13f4f13d1b6_doc_idx.npy [2024-06-16 18:24:28][INFO][utils:246] - [0] > loading sample-idx mapping from checkpoints/ws8_ds_stage1_nl48_hs1536_mb1_seq2048_gb32_sp1_pp1_tp1_fp16_optadamw_lr0.0003_lwf0.05_tokgpt_flash/69a345d3a236b6174094c13f4f13d1b6_sample_idx.npy [2024-06-16 18:24:28][INFO][utils:246] - [0] > loading shuffle-idx mapping from checkpoints/ws8_ds_stage1_nl48_hs1536_mb1_seq2048_gb32_sp1_pp1_tp1_fp16_optadamw_lr0.0003_lwf0.05_tokgpt_flash/69a345d3a236b6174094c13f4f13d1b6_shuffle_idx.npy [2024-06-16 18:24:28][INFO][utils:246] - [0] loaded indexed file in 0.006 seconds [2024-06-16 18:24:28][INFO][utils:246] - [0] total number of samples: 10240307 [2024-06-16 18:24:28][INFO][utils:246] - [0] total number of epochs: 13 [2024-06-16 18:24:28][INFO][pretrain_gpt_alcf:531] - > finished creating GPT datasets ... [2024-06-16 18:24:28][INFO][training:73] - [after dataloaders are built] datetime=2024-06-16 18:24:28 [2024-06-16 18:24:28][INFO][training:259] - done with setup ... [2024-06-16 18:24:28][INFO][training:264] - training ... (min, max) time across ranks (ms): model-and-optimizer-setup ......................: (6407.55, 6424.35) train/valid/test-data-iterators-setup ..........: (297.10, 435.80) [2024-06-16 18:24:28][INFO][training:73] - [before the start of training step] datetime=2024-06-16 18:24:28 [2024-06-16 18:24:28,947] [INFO] [checkpointing.py:540:forward] Activation Checkpointing Information [2024-06-16 18:24:28,947] [INFO] [checkpointing.py:541:forward] ----Partition Activations False, CPU CHECKPOINTING False [2024-06-16 18:24:28,947] [INFO] [checkpointing.py:542:forward] ----contiguous Memory Checkpointing False with 48 total layers [2024-06-16 18:24:28,947] [INFO] [checkpointing.py:544:forward] ----Synchronization False [2024-06-16 18:24:28,947] [INFO] [checkpointing.py:545:forward] ----Profiling time in checkpointing False [2024-06-16 18:24:31,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 237.75 | optimizer_gradients: 9.40 | optimizer_step: 38.30 [2024-06-16 18:24:31,150] [INFO] [logging.py:96:log_dist] [Rank 0] step=1, skipped=0, lr=[1.887433467970254e-08, 1.887433467970254e-08], mom=[(0.9, 0.999), (0.9, 0.999)] [2024-06-16 18:24:31,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.20 | bwd_microstep: 1380.13 | bwd_inner_microstep: 883.55 | bwd_allreduce_microstep: 496.45 | step_microstep: 361.95 [2024-06-16 18:24:31,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 730.15 | bwd: 1380.12 | bwd_inner: 883.53 | bwd_allreduce: 496.46 | step: 361.95 [2024-06-16 18:24:31][INFO][training:1251] - iteration= 1/ 317892 | consumed_samples= 32 | consumed_tokens= 65536 | elapsed_time_per_iteration_ms=2500.0 | learning_rate=0.000000 | global_batch_size= 32 | lm loss=11.071705 | loss_scale=65536.0 |actual_seqlen= 2048 | number_of_skipped_iterations= 0 | number_of_nan_iterations= 0 | samples_per_second=12.800 | tokens_per_gpu_per_second_tgs=3276.803 | TFLOPs=45.06 | [Rank 0] (after 1 iterations) memory (MB) | allocated: 4872.24609375 | max allocated: 8483.11669921875 | reserved: 11406.0 | max reserved: 11406.0 (min, max) time across ranks (ms): forward-backward ...............................: (2124.88, 2125.45) optimizer ......................................: (362.11, 362.61) [2024-06-16 18:24:33,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 237.74 | optimizer_gradients: 3.40 | optimizer_step: 12.93 [2024-06-16 18:24:33,044] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[3.774866935940508e-08, 3.774866935940508e-08], mom=[(0.9, 0.999), (0.9, 0.999)] [2024-06-16 18:24:33,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 278.94 | bwd_microstep: 1316.77 | bwd_inner_microstep: 821.20 | bwd_allreduce_microstep: 495.48 | step_microstep: 274.70 [2024-06-16 18:24:33,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 278.90 | bwd: 1316.78 | bwd_inner: 821.20 | bwd_allreduce: 495.49 | step: 274.70 [2024-06-16 18:24:33][INFO][training:1251] - iteration= 2/ 317892 | consumed_samples= 64 | consumed_tokens= 131072 | elapsed_time_per_iteration_ms=1893.9 | learning_rate=0.000000 | global_batch_size= 32 | lm loss=11.067770 | loss_scale=65536.0 |actual_seqlen= 2048 | number_of_skipped_iterations= 0 | number_of_nan_iterations= 0 | samples_per_second=16.896 | tokens_per_gpu_per_second_tgs=4325.475 | TFLOPs=59.48 | (min, max) time across ranks (ms): forward-backward ...............................: (1600.40, 1600.90) optimizer ......................................: (274.85, 275.27) ```

🤷🏻‍♂️

saforem2 commented 4 months ago

though, now that I'm thinking about it, I've also got the changes from @zhenghh04's PR (Distributed data loading #16) that might be impacting the behavior here ??

$ git status
On branch llama-toggle
Your branch is up to date with 'origin/llama-toggle'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    modified:   ALCF/test_blendable_dataset.py
    modified:   megatron/__init__.py
    modified:   megatron/data/blendable_dataset.py
    modified:   megatron/data/gpt_dataset.py
    modified:   megatron/model/transformer.py
    modified:   megatron/utils.py

at any rate, since it works, and we're not planning to target GPT architecture explicitly in the future[^isolated],

this seems good enough for now ??

[^isolated]: and, as far as I can tell, this issue only seems to pop up when using the the GPT2BPETokenizer with the BookCorpusDataset