Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
10.59k stars 1.05k forks source link

"RuntimeError: All the chunks should have been deleted." on non-Studio machine #1716

Open rasbt opened 1 month ago

rasbt commented 1 month ago

Bug description

When running the pretraining example:

mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

# 1) Download a tokenizer
litgpt download EleutherAI/pythia-160m \
  --tokenizer_only True

# 2) Pretrain the model
litgpt pretrain EleutherAI/pythia-160m \
  --tokenizer_dir EleutherAI/pythia-160m \
  --data TextFiles \
  --data.train_data_path "custom_texts/" \
  --train.max_tokens 10_000_000 \
  --out_dir out/custom-model

# 3) Test the model
litgpt chat out/custom-model/final

on a non-Studio machine, it results in the following issue.

litgpt pretrain EleutherAI/pythia-160m   --tokenizer_dir EleutherAI/pythia-160m   --data TextFiles   --data.train_data_path /home/sebastian/custom_texts/   --train.max_tokens 10_000   --out_dir out/custom-model
uvloop is not installed. Falling back to the default asyncio event loop.
/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/lightning/fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python3.10 /home/sebastian/miniforge3/envs/litgp2/bin/litgp ...
Using bfloat16 Automatic Mixed Precision (AMP)
{'data': {'batch_size': 1,
          'max_seq_length': -1,
          'num_workers': 1,
          'seed': 42,
          'tokenizer': None,
          'train_data_path': PosixPath('/home/sebastian/custom_texts'),
          'val_data_path': None},
 'devices': 'auto',
 'eval': {'final_validation': True,
          'initial_validation': False,
          'interval': 1000,
          'max_iters': 100,
          'max_new_tokens': None},
 'initial_checkpoint_dir': None,
 'logger_name': 'tensorboard',
 'model_config': {'attention_logit_softcapping': None,
                  'attention_scores_scalar': None,
                  'bias': True,
                  'block_size': 2048,
                  'final_logit_softcapping': None,
                  'gelu_approximate': 'none',
                  'head_size': 64,
                  'hf_config': {'name': 'pythia-160m', 'org': 'EleutherAI'},
                  'intermediate_size': 3072,
                  'lm_head_bias': False,
                  'mlp_class_name': 'GptNeoxMLP',
                  'n_embd': 768,
                  'n_expert': 0,
                  'n_expert_per_token': 0,
                  'n_head': 12,
                  'n_layer': 12,
                  'n_query_groups': 12,
                  'name': 'pythia-160m',
                  'norm_class_name': 'LayerNorm',
                  'norm_eps': 1e-05,
                  'padded_vocab_size': 50304,
                  'padding_multiple': 128,
                  'parallel_residual': True,
                  'post_attention_norm': False,
                  'post_mlp_norm': False,
                  'rope_base': 10000,
                  'rope_condense_ratio': 1,
                  'rotary_percentage': 0.25,
                  'scale_embeddings': False,
                  'shared_attention_norm': False,
                  'sliding_window_layer_placing': None,
                  'sliding_window_size': None,
                  'vocab_size': 50254},
 'model_name': 'EleutherAI/pythia-160m',
 'num_nodes': 1,
 'optimizer': 'AdamW',
 'out_dir': PosixPath('out/custom-model'),
 'precision': None,
 'resume': False,
 'seed': 42,
 'tokenizer_dir': PosixPath('checkpoints/EleutherAI/pythia-160m'),
 'train': {'epochs': None,
           'global_batch_size': 512,
           'log_interval': 1,
           'lr_warmup_fraction': None,
           'lr_warmup_steps': 2000,
           'max_norm': 1.0,
           'max_seq_length': None,
           'max_steps': None,
           'max_tokens': 10000,
           'micro_batch_size': 4,
           'min_lr': 4e-05,
           'save_interval': 1000,
           'tie_embeddings': False}}
Seed set to 42
Time to instantiate model: 1.23 seconds.
Total parameters: 162,322,944
Create an account on https://lightning.ai/ to optimize your data faster using multiple nodes and large machines.
Storing the files under /home/sebastian/custom_texts/train
Setup started with fast_dev_run=False.
Setup finished in 0.001 seconds. Found 1 items to process.
Starting 1 workers with 1 items. The progress bar is only updated when a worker finishes.
Workers are ready ! Starting data processing...
                                                                                                                                        Rank 0 inferred the following `['no_header_tensor:16']` data format.                                                | 0/1 [00:00<?, ?it/s]
Worker 0 is terminating.
Worker 0 is done.
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.47it/s]
Workers are finished.
Traceback (most recent call last):
  File "/home/sebastian/miniforge3/envs/litgp2/bin/litgpt", line 8, in <module>
    sys.exit(main())
  File "/home/sebastian/test-litgpt/litgpt/litgpt/__main__.py", line 71, in main
    CLI(parser_data)
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/jsonargparse/_cli.py", line 119, in CLI
    return _run_component(component, init.get(subcommand))
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/jsonargparse/_cli.py", line 204, in _run_component
    return component(**cfg)
  File "/home/sebastian/test-litgpt/litgpt/litgpt/pretrain.py", line 154, in setup
    main(
  File "/home/sebastian/test-litgpt/litgpt/litgpt/pretrain.py", line 214, in main
    train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
  File "/home/sebastian/test-litgpt/litgpt/litgpt/pretrain.py", line 409, in get_dataloaders
    data.prepare_data()
  File "/home/sebastian/test-litgpt/litgpt/litgpt/data/text_files.py", line 72, in prepare_data
    optimize(
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/litdata/processing/functions.py", line 375, in optimize
    data_processor.run(
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 1016, in run
    result = data_recipe._done(len(user_items), self.delete_cached_files, self.output_dir)
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 736, in _done
    raise RuntimeError(f"All the chunks should have been deleted. Found {chunks}")
RuntimeError: All the chunks should have been deleted. Found ['chunk-0-0.bin']

What operating system are you using?

Linux

LitGPT Version

Latest LitGPT version in main, and both LitData 0.2.17 and latest 0.2.26

rasbt commented 1 month ago

Might be a LitData bug. Reported it here with a smaller self-contained example that doesn't use LitGPT: https://github.com/Lightning-AI/litdata/issues/367