Negative document indices caused by 64 bit integer stored in a 32 bit integer array.

pwstegman commented 2 years ago

Describe the bug

While training on The Pile, I was getting errors from sparse attention, claiming that the sequence length wasn't divisible by the block size, despite using a sequence length of 8192 and a block size of 16. This was caused by negative document indices in the dataset, which caused weird sample lengths (screenshot included in screenshot section). The negative document indices were caused by a wraparound at the 32 bit signed integer limit. More info in the Proposed Solution section.

To Reproduce

I'm using the docker image leogao2/gpt-neox:sha-6dc7645. My training script is:

/bin/bash
python deepy.py train.py --conf_dir configs local_setup.yml sparse.yml 13B.yml

Configs are included in the Environment section.

Expected behavior

Each sample should be exactly 8193 tokens.

Proposed solution

In short, I traced the issue to this function: https://github.com/EleutherAI/gpt-neox/blob/98683aee2a697027002ea9c907bc160dbaf2539a/megatron/data/helpers.cpp#L100

It keeps looping until the target number of samples is reached:

https://github.com/EleutherAI/gpt-neox/blob/98683aee2a697027002ea9c907bc160dbaf2539a/megatron/data/helpers.cpp#L145

There are only ~200m documents, but since the loop covers multiple epochs, and there may be multiple documents per sample, the document index quickly went over 2.1 billion. The document index variable itself is a 64 bit signed integer:

https://github.com/EleutherAI/gpt-neox/blob/98683aee2a697027002ea9c907bc160dbaf2539a/megatron/data/helpers.cpp#L137

However, it's being stored in a 32 bit signed integer array, and this is where the wraparound to negative values happens:

https://github.com/EleutherAI/gpt-neox/blob/98683aee2a697027002ea9c907bc160dbaf2539a/megatron/data/helpers.cpp#L122 https://github.com/EleutherAI/gpt-neox/blob/98683aee2a697027002ea9c907bc160dbaf2539a/megatron/data/helpers.cpp#L168

To solve this, I propose:

The array should be updated to a 64 bit signed integer array.
Before being stored in the array, the document index should be modulo the number of documents.

I can submit a PR to take care of both of these if this sounds reasonable.

Screenshots

Here's one screenshot I took which highlights the core of the issue:

Environment (please complete the following information):

GPUs: 8x A100-SXM-80GB
- Configs: See below

13B.yml

# GPT-2 pretraining setup
{
   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
   # across the node boundaries )
   "pipe-parallel-size": 1,
   "model-parallel-size": 4,

   # model settings
   "num-layers": 40,
   "hidden-size": 5120,
   "num-attention-heads": 40,
   "seq-length": 8192,
   "max-position-embeddings": 8192,
   "norm": "layernorm",
   "pos-emb": "rotary",
   "no-weight-tying": true,

   # these should provide some speedup but takes a while to build, set to true if desired
   "scaled-upper-triang-masked-softmax-fusion": true,
   "bias-gelu-fusion": true,

   # optimizer settings
   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 0.0000001,
       "betas": [0.9, 0.999],
       "eps": 1.0e-8,
     }
   },

   # ZeRO
   "zero_optimization": {
    "stage": 1,
    "allgather_partitions": True,
    "allgather_bucket_size": 500000000,
    "overlap_comm": True,
    "reduce_scatter": True,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients": True,
    "round_robin_gradients": True,
    "cpu_offload": False
  },

   # batch / data settings
   "train_micro_batch_size_per_gpu": 1,
   "data-impl": "mmap",
   "split": "949,50,1",
   "gradient_accumulation_steps": 128,

   # activation checkpointing
   "checkpoint-activations": true,
   "checkpoint-num-layers": 1,
   "partition-activations": true,
   "synchronize-each-layer": true,

   # regularization
   "gradient_clipping": 1.0,
   "weight-decay": 0.1,
   "hidden-dropout": 0.1,
   "attention-dropout": 0.1,

   # precision settings
   "fp16": { 
     "fp16": true,
     "enabled": true,
     "loss_scale": 0,
     "loss_scale_window": 1000,
     "hysteresis": 2,
     "min_loss_scale": 1
   },

   # misc. training settings
   "train-iters": 3200000,
   "lr-decay-iters": 3200000,
   "distributed-backend": "nccl",
   "lr-decay-style": "cosine",
   "warmup": 0.00004,
   "min_lr": 0.00000001,
   "save-interval": 80,
   "eval-interval": 20,
   "eval-iters": 1,

   # logging
   "log-interval": 1,
   "steps_per_print": 1,
   "keep-last-n-checkpoints": 6,
   "wall_clock_breakdown": true,

   "override_lr_scheduler": true,
   "use_checkpoint_lr_scheduler": false,
   "finetune": true
}

local_setup.yml

{
  "data-path": "/mnt/4TBNVME/data/the_pile_preprocessed/train/the_pile_text_document",

  "vocab-file": "/mnt/4TBNVME/data/gpt2-vocab.json",
  "merge-file": "/mnt/4TBNVME/data/gpt2-merges.txt",

  "save": "/mnt/4TBNVME/checkpoints",
  "load": "/mnt/4TBNVME/checkpoints",
  "checkpoint_validation_with_forward_pass": False,

  "tensorboard-dir": "/mnt/4TBNVME/tensorboard_logs/run19",
  "log-dir": "/mnt/4TBNVME/gptneox_logs/run19",
  "use_wandb": False,
  "wandb_host": "https://api.wandb.ai",
  "wandb_project": "neox"
}

sparse.yml

# Add this to your config for sparse attention every other layer
{
  "attention_config": [[["local", "local"], "all"]],

  # sparsity config:
  # (these are the defaults for local sliding window sparsity, training will work without this here, but it's left in for
  # illustrative purposes)
  # see https://www.deepspeed.ai/tutorials/sparse-attention/#how-to-config-sparsity-structures for
  # more detailed config instructions and available parameters

  "sparsity_config": {
    "block": 16, # block size
    "num_local_blocks": 32,
  }
}

Additional context

None

StellaAthena commented 2 years ago

This seems like a generally good idea, though I’m very intrigued about some of your config choices. You’re finetuning a 13B parameter model with a sequence length of 8192? And doing more than 10 epochs on the Pile?

pwstegman commented 2 years ago

Great, I'll get to working on a PR! Realistically I won't be able to train on 10 epochs. I was just changing around parameters on the config and running little tests, and I stumbled across this bug by accident. I am curious whether the model can learn to process long-form text though, hence the 8192 sequence length.

Unrelated, I noticed that model checkpoints are stored in a way that is specific to the 3D parallelism config. Is it possible to take a checkpoint that used a model parallelism of 4 and update it to a model parallelism of 8? I was thinking it should be possible to write a conversion script that copies all the weights over into the right locations, but wasn't sure if something like that already existed.

StellaAthena commented 2 years ago

That is a functionality we are currently exploring. It is unfortunately non-trivial :/

StellaAthena commented 1 year ago

@pwstegman did you ever solve this issue?

StellaAthena commented 1 year ago

@haileyschoelkopf @ShivanshuPurohit @Quentin-Anthony this was the issue that you independently discovered and then patched right?

Quentin-Anthony commented 1 year ago

Yeah this should be fixed by https://github.com/EleutherAI/gpt-neox/pull/835

EleutherAI / gpt-neox

Negative document indices caused by 64 bit integer stored in a 32 bit integer array. #493