mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt
# 1) Download a tokenizer
litgpt download EleutherAI/pythia-160m \
--tokenizer_only True
# 2) Pretrain the model
litgpt pretrain EleutherAI/pythia-160m \
--tokenizer_dir EleutherAI/pythia-160m \
--data TextFiles \
--data.train_data_path "custom_texts/" \
--train.max_tokens 10_000_000 \
--out_dir out/custom-model
# 3) Test the model
litgpt chat out/custom-model/final
on a non-Studio machine, it results in the following issue.
litgpt pretrain EleutherAI/pythia-160m --tokenizer_dir EleutherAI/pythia-160m --data TextFiles --data.train_data_path /home/sebastian/custom_texts/ --train.max_tokens 10_000 --out_dir out/custom-model
uvloop is not installed. Falling back to the default asyncio event loop.
/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/lightning/fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python3.10 /home/sebastian/miniforge3/envs/litgp2/bin/litgp ...
Using bfloat16 Automatic Mixed Precision (AMP)
{'data': {'batch_size': 1,
'max_seq_length': -1,
'num_workers': 1,
'seed': 42,
'tokenizer': None,
'train_data_path': PosixPath('/home/sebastian/custom_texts'),
'val_data_path': None},
'devices': 'auto',
'eval': {'final_validation': True,
'initial_validation': False,
'interval': 1000,
'max_iters': 100,
'max_new_tokens': None},
'initial_checkpoint_dir': None,
'logger_name': 'tensorboard',
'model_config': {'attention_logit_softcapping': None,
'attention_scores_scalar': None,
'bias': True,
'block_size': 2048,
'final_logit_softcapping': None,
'gelu_approximate': 'none',
'head_size': 64,
'hf_config': {'name': 'pythia-160m', 'org': 'EleutherAI'},
'intermediate_size': 3072,
'lm_head_bias': False,
'mlp_class_name': 'GptNeoxMLP',
'n_embd': 768,
'n_expert': 0,
'n_expert_per_token': 0,
'n_head': 12,
'n_layer': 12,
'n_query_groups': 12,
'name': 'pythia-160m',
'norm_class_name': 'LayerNorm',
'norm_eps': 1e-05,
'padded_vocab_size': 50304,
'padding_multiple': 128,
'parallel_residual': True,
'post_attention_norm': False,
'post_mlp_norm': False,
'rope_base': 10000,
'rope_condense_ratio': 1,
'rotary_percentage': 0.25,
'scale_embeddings': False,
'shared_attention_norm': False,
'sliding_window_layer_placing': None,
'sliding_window_size': None,
'vocab_size': 50254},
'model_name': 'EleutherAI/pythia-160m',
'num_nodes': 1,
'optimizer': 'AdamW',
'out_dir': PosixPath('out/custom-model'),
'precision': None,
'resume': False,
'seed': 42,
'tokenizer_dir': PosixPath('checkpoints/EleutherAI/pythia-160m'),
'train': {'epochs': None,
'global_batch_size': 512,
'log_interval': 1,
'lr_warmup_fraction': None,
'lr_warmup_steps': 2000,
'max_norm': 1.0,
'max_seq_length': None,
'max_steps': None,
'max_tokens': 10000,
'micro_batch_size': 4,
'min_lr': 4e-05,
'save_interval': 1000,
'tie_embeddings': False}}
Seed set to 42
Time to instantiate model: 1.23 seconds.
Total parameters: 162,322,944
Create an account on https://lightning.ai/ to optimize your data faster using multiple nodes and large machines.
Storing the files under /home/sebastian/custom_texts/train
Setup started with fast_dev_run=False.
Setup finished in 0.001 seconds. Found 1 items to process.
Starting 1 workers with 1 items. The progress bar is only updated when a worker finishes.
Workers are ready ! Starting data processing...
Rank 0 inferred the following `['no_header_tensor:16']` data format. | 0/1 [00:00<?, ?it/s]
Worker 0 is terminating.
Worker 0 is done.
Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.47it/s]
Workers are finished.
Traceback (most recent call last):
File "/home/sebastian/miniforge3/envs/litgp2/bin/litgpt", line 8, in <module>
sys.exit(main())
File "/home/sebastian/test-litgpt/litgpt/litgpt/__main__.py", line 71, in main
CLI(parser_data)
File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/jsonargparse/_cli.py", line 119, in CLI
return _run_component(component, init.get(subcommand))
File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/jsonargparse/_cli.py", line 204, in _run_component
return component(**cfg)
File "/home/sebastian/test-litgpt/litgpt/litgpt/pretrain.py", line 154, in setup
main(
File "/home/sebastian/test-litgpt/litgpt/litgpt/pretrain.py", line 214, in main
train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
File "/home/sebastian/test-litgpt/litgpt/litgpt/pretrain.py", line 409, in get_dataloaders
data.prepare_data()
File "/home/sebastian/test-litgpt/litgpt/litgpt/data/text_files.py", line 72, in prepare_data
optimize(
File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/litdata/processing/functions.py", line 375, in optimize
data_processor.run(
File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 1016, in run
result = data_recipe._done(len(user_items), self.delete_cached_files, self.output_dir)
File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 736, in _done
raise RuntimeError(f"All the chunks should have been deleted. Found {chunks}")
RuntimeError: All the chunks should have been deleted. Found ['chunk-0-0.bin']
What operating system are you using?
Linux
LitGPT Version
Latest LitGPT version in main, and both LitData 0.2.17 and latest 0.2.26
Bug description
When running the pretraining example:
on a non-Studio machine, it results in the following issue.
What operating system are you using?
Linux
LitGPT Version
Latest LitGPT version in main, and both LitData 0.2.17 and latest 0.2.26