Lightning-AI / litgpt

20+ high-performance LLM implementations with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
8.09k stars 817 forks source link

Using custom data for `Continue pretraining an LLM` #1450

Open SimiPixel opened 1 month ago

SimiPixel commented 1 month ago

The example (https://github.com/Lightning-AI/litgpt?tab=readme-ov-file#continue-pretraining-an-llm) works fine on my machine but as soon as i replace with custom text files that each just contain one english language sentence with no special characters the example no longer works.

I don't understand what the difference between the provided data examples is, and my custom data. Is there some special formatting that i am not seeing?

litgpt pretrain --model_name Meta-Llama-3-8B-Instruct --tokenizer_dir $WORK/checkpoints/meta-llama/Meta-Llama-3-8B-Instruct --initial_checkpoint_dir $WORK/checkpoints/meta-llama/Meta-Llama-3-8B-Instruct --data TextFiles --data.train_data_path "/data/custom_texts/" --train.max_tokens 100_000 --out_dir $WORK/out/custom-model

which results in

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
[rank: 3] Seed set to 42
[rank: 2] Seed set to 42
[rank: 1] Seed set to 42
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

{'data': {'batch_size': 1,
          'max_seq_length': -1,
          'num_workers': 4,
          'seed': 42,
          'tokenizer': None,
          'train_data_path': PosixPath('/home/woody/iwb0/iwb0003h/custom_texts'),
          'val_data_path': None},
 'devices': 'auto',
 'eval': {'initial_validation': False,
          'interval': 1000,
          'max_iters': 100,
          'max_new_tokens': None},
 'initial_checkpoint_dir': PosixPath('/home/woody/iwb0/iwb0003h/checkpoints/meta-llama/Meta-Llama-3-8B-Instruct'),
 'logger_name': 'tensorboard',
 'model_config': None,
 'model_name': 'Meta-Llama-3-8B-Instruct',
 'out_dir': PosixPath('/home/woody/iwb0/iwb0003h/out/custom-model'),
 'precision': None,
 'resume': False,
 'seed': 42,
 'tokenizer_dir': PosixPath('/home/woody/iwb0/iwb0003h/checkpoints/meta-llama/Meta-Llama-3-8B-Instruct'),
 'train': {'beta1': 0.9,
           'beta2': 0.95,
           'epochs': None,
           'global_batch_size': 512,
           'learning_rate': 0.0004,
           'log_interval': 1,
           'lr_warmup_fraction': None,
           'lr_warmup_steps': 2000,
           'max_norm': 1.0,
           'max_seq_length': None,
           'max_steps': None,
           'max_tokens': 100000,
           'micro_batch_size': 4,
           'min_lr': 4e-05,
           'save_interval': 1000,
           'tie_embeddings': False,
           'weight_decay': 0.1}}
[rank: 0] Seed set to 42
Time to instantiate model: 0.03 seconds.
Total parameters: 8,030,261,248
Create an account on https://lightning.ai/ to optimize your data faster using multiple nodes and large machines.
Storing the files under /home/woody/iwb0/iwb0003h/custom_texts/train
Setup started with fast_dev_run=False.
Worker 0 gets 0.0 MB (1 files)
Setup finished in 0.002 seconds. Found 1 items to process.
Starting 1 workers with 1 items. The progress bar is only updated when a worker finishes.
Workers are ready ! Starting data processing...
                                                                                                                                                                                  Rank 0 inferred the following `['no_header_tensor:16']` data format.                                                                                          | 0/1 [00:00<?, ?it/s]
Worker 0 is terminating.
Worker 0 is done.
Progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.68it/s]
Workers are finished.
Finished data processing!
Create an account on https://lightning.ai/ to optimize your data faster using multiple nodes and large machines.
Storing the files under /home/woody/iwb0/iwb0003h/custom_texts/val
Setup started with fast_dev_run=False.
Worker 0 gets 0.0 MB (1 files)
Setup finished in 0.001 seconds. Found 1 items to process.
Starting 1 workers with 1 items. The progress bar is only updated when a worker finishes.
Workers are ready ! Starting data processing...
                                                                                                                                                                                  Rank 0 inferred the following `['no_header_tensor:16']` data format.                                                                                          | 0/1 [00:00<?, ?it/s]
Worker 0 is terminating.
Worker 0 is done.
Progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.69it/s]
Workers are finished.
Finished data processing!
Validating ...
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/bin/litgpt", line 8, in <module>
[rank0]:     sys.exit(main())
[rank0]:              ^^^^^^
[rank0]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/__main__.py", line 143, in main
[rank0]:     fn(**kwargs)
[rank0]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/pretrain.py", line 123, in setup
[rank0]:     main(
[rank0]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/pretrain.py", line 207, in main
[rank0]:     fit(fabric, devices, state, train_dataloader, val_dataloader, out_dir, tokenizer_dir, train, eval)
[rank0]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/pretrain.py", line 235, in fit
[rank0]:     validate(fabric, model, val_dataloader, max_iters=2)   # sanity check
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/pretrain.py", line 362, in validate
[rank0]:     val_loss = torch.stack(losses).mean()
[rank0]:                ^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: stack expects a non-empty TensorList
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/bin/litgpt", line 8, in <module>
[rank3]:     sys.exit(main())
[rank3]:              ^^^^^^
[rank3]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/__main__.py", line 143, in main
[rank3]:     fn(**kwargs)
[rank3]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/pretrain.py", line 123, in setup
[rank3]:     main(
[rank3]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/pretrain.py", line 207, in main
[rank3]:     fit(fabric, devices, state, train_dataloader, val_dataloader, out_dir, tokenizer_dir, train, eval)
[rank3]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/pretrain.py", line 235, in fit
[rank3]:     validate(fabric, model, val_dataloader, max_iters=2)   # sanity check
[rank3]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/pretrain.py", line 362, in validate
[rank3]:     val_loss = torch.stack(losses).mean()
[rank3]:                ^^^^^^^^^^^^^^^^^^^
[rank3]: RuntimeError: stack expects a non-empty TensorList
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/bin/litgpt", line 8, in <module>
[rank1]:     sys.exit(main())
[rank1]:              ^^^^^^
[rank1]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/__main__.py", line 143, in main
[rank1]:     fn(**kwargs)
[rank1]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/pretrain.py", line 123, in setup
[rank1]:     main(
[rank1]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/pretrain.py", line 207, in main
[rank1]:     fit(fabric, devices, state, train_dataloader, val_dataloader, out_dir, tokenizer_dir, train, eval)
[rank1]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/pretrain.py", line 235, in fit
[rank1]:     validate(fabric, model, val_dataloader, max_iters=2)   # sanity check
[rank1]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/pretrain.py", line 362, in validate
[rank1]:     val_loss = torch.stack(losses).mean()
[rank1]:                ^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: stack expects a non-empty TensorList
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/bin/litgpt", line 8, in <module>
[rank2]:     sys.exit(main())
[rank2]:              ^^^^^^
[rank2]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/__main__.py", line 143, in main
[rank2]:     fn(**kwargs)
[rank2]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/pretrain.py", line 123, in setup
[rank2]:     main(
[rank2]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/pretrain.py", line 207, in main
[rank2]:     fit(fabric, devices, state, train_dataloader, val_dataloader, out_dir, tokenizer_dir, train, eval)
[rank2]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/pretrain.py", line 235, in fit
[rank2]:     validate(fabric, model, val_dataloader, max_iters=2)   # sanity check
[rank2]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank2]:     return func(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litgpt/pretrain.py", line 362, in validate
[rank2]:     val_loss = torch.stack(losses).mean()
[rank2]:                ^^^^^^^^^^^^^^^^^^^
[rank2]: RuntimeError: stack expects a non-empty TensorList
rasbt commented 1 month ago

Good question. Maybe it's too small so it can't generate the validation set. Does the same issue occur if you make the dataset larger, e.g., duplicating the sentence?

SimiPixel commented 1 month ago

I created the the files ex1.txt and ex2.txt which each contain 1000 lines with the sentence Roses are always blue and the world is populated by roses.

The error is the same.

rasbt commented 1 month ago

Thanks, this definitely sounds like an issue then to look into.

SimiPixel commented 1 month ago

After some first investigation i believe this to be related to the fact that max_seq_length=-1.

Consider the following example:

from litgpt.data import TextFiles
from pathlib import Path
from litgpt.tokenizer import Tokenizer

tokenizer = Tokenizer("/home/woody/iwb0/iwb0003h/checkpoints/meta-llama/Meta-Llama-3-8B-Instruct")
text_files = TextFiles(Path("/home/woody/iwb0/iwb0003h/custom_texts_roses"))
text_files.connect(tokenizer, max_seq_length=-1)
text_files.prepare_data()
text_files.setup()
dl = text_files.val_dataloader()
sample = next(iter(dl))

This throws the error

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
Cell In[37], line 1
----> 1 len(dl)

File /home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/torch/utils/data/dataloader.py:475, in DataLoader.__len__(self)
    457 def __len__(self) -> int:
    458     if self._dataset_kind == _DatasetKind.Iterable:
    459         # NOTE [ IterableDataset and __len__ ]
    460         #
   (...)
    473 
    474         # Cannot statically verify that dataset is Sized
--> 475         length = self._IterableDataset_len_called = len(self.dataset)  # type: ignore[assignment, arg-type]
    476         if self.batch_size is not None:  # IterableDataset doesn't allow custom sampler or batch_sampler
    477             from math import ceil

File /home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litdata/streaming/dataset.py:162, in StreamingDataset.__len__(self)
    161 def __len__(self) -> int:
--> 162     return self.get_len(1, 1)

File /home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litdata/streaming/dataset.py:169, in StreamingDataset.get_len(self, num_workers, batch_size)
    167 worker_env = _WorkerEnv.detect()
    168 if self.shuffler is None:
--> 169     cache = self._create_cache(worker_env=worker_env)
    170     self.shuffler = self._create_shuffler(cache)
    171 return self.shuffler.get_len(self.distributed_env, num_workers, batch_size, self.current_epoch)

File /home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litdata/streaming/dataset.py:142, in StreamingDataset._create_cache(self, worker_env)
    133         self.input_dir.path = cache_path
    135 cache = Cache(
    136     input_dir=self.input_dir,
    137     item_loader=self.item_loader,
   (...)
    140     max_cache_size=self.max_cache_size,
    141 )
--> 142 cache._reader._try_load_config()
    144 if not cache.filled:
    145     raise ValueError(
    146         f"The provided dataset `{self.input_dir}` doesn't contain any {_INDEX_FILENAME} file."
    147         " HINT: Did you successfully optimize a dataset to the provided `input_dir`?"
    148     )

File /home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litdata/streaming/reader.py:211, in BinaryReader._try_load_config(self)
    209 def _try_load_config(self) -> Optional[ChunksConfig]:
    210     """Try to load the chunks config if the index files are available."""
--> 211     self._config = ChunksConfig.load(self._cache_dir, self._serializers, self._remote_input_dir, self._item_loader)
    212     return self._config

File /home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litdata/streaming/config.py:213, in ChunksConfig.load(cls, cache_dir, serializers, remote_dir, item_loader)
    210 if not os.path.exists(cache_index_filepath):
    211     return None
--> 213 return ChunksConfig(cache_dir, serializers, remote_dir, item_loader)

File /home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litdata/streaming/config.py:61, in ChunksConfig.__init__(self, cache_dir, serializers, remote_dir, item_loader)
     58 self._config["data_spec"] = treespec_loads(self._config["data_spec"])
     60 self._item_loader.setup(self._config, self._chunks, serializers)
---> 61 self._intervals = self._item_loader.generate_intervals()
     62 self._length = self._intervals[-1][-1]
     63 self._downloader = None

File /home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litdata/streaming/item_loader.py:177, in TokensLoader.generate_intervals(self)
    175 for chunk in self._chunks:
    176     dim = chunk["dim"]
--> 177     num_blocks = dim // self._block_size
    178     end += num_blocks
    179     intervals.append((begin, end))

ZeroDivisionError: integer division or modulo by zero

However, setting the max_seq_length e.g. to 10, the code works just fine.