Open SimiPixel opened 1 month ago
Good question. Maybe it's too small so it can't generate the validation set. Does the same issue occur if you make the dataset larger, e.g., duplicating the sentence?
I created the the files ex1.txt
and ex2.txt
which each contain 1000 lines with the sentence Roses are always blue and the world is populated by roses.
The error is the same.
Thanks, this definitely sounds like an issue then to look into.
After some first investigation i believe this to be related to the fact that max_seq_length=-1
.
Consider the following example:
from litgpt.data import TextFiles
from pathlib import Path
from litgpt.tokenizer import Tokenizer
tokenizer = Tokenizer("/home/woody/iwb0/iwb0003h/checkpoints/meta-llama/Meta-Llama-3-8B-Instruct")
text_files = TextFiles(Path("/home/woody/iwb0/iwb0003h/custom_texts_roses"))
text_files.connect(tokenizer, max_seq_length=-1)
text_files.prepare_data()
text_files.setup()
dl = text_files.val_dataloader()
sample = next(iter(dl))
This throws the error
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
Cell In[37], line 1
----> 1 len(dl)
File /home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/torch/utils/data/dataloader.py:475, in DataLoader.__len__(self)
457 def __len__(self) -> int:
458 if self._dataset_kind == _DatasetKind.Iterable:
459 # NOTE [ IterableDataset and __len__ ]
460 #
(...)
473
474 # Cannot statically verify that dataset is Sized
--> 475 length = self._IterableDataset_len_called = len(self.dataset) # type: ignore[assignment, arg-type]
476 if self.batch_size is not None: # IterableDataset doesn't allow custom sampler or batch_sampler
477 from math import ceil
File /home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litdata/streaming/dataset.py:162, in StreamingDataset.__len__(self)
161 def __len__(self) -> int:
--> 162 return self.get_len(1, 1)
File /home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litdata/streaming/dataset.py:169, in StreamingDataset.get_len(self, num_workers, batch_size)
167 worker_env = _WorkerEnv.detect()
168 if self.shuffler is None:
--> 169 cache = self._create_cache(worker_env=worker_env)
170 self.shuffler = self._create_shuffler(cache)
171 return self.shuffler.get_len(self.distributed_env, num_workers, batch_size, self.current_epoch)
File /home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litdata/streaming/dataset.py:142, in StreamingDataset._create_cache(self, worker_env)
133 self.input_dir.path = cache_path
135 cache = Cache(
136 input_dir=self.input_dir,
137 item_loader=self.item_loader,
(...)
140 max_cache_size=self.max_cache_size,
141 )
--> 142 cache._reader._try_load_config()
144 if not cache.filled:
145 raise ValueError(
146 f"The provided dataset `{self.input_dir}` doesn't contain any {_INDEX_FILENAME} file."
147 " HINT: Did you successfully optimize a dataset to the provided `input_dir`?"
148 )
File /home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litdata/streaming/reader.py:211, in BinaryReader._try_load_config(self)
209 def _try_load_config(self) -> Optional[ChunksConfig]:
210 """Try to load the chunks config if the index files are available."""
--> 211 self._config = ChunksConfig.load(self._cache_dir, self._serializers, self._remote_input_dir, self._item_loader)
212 return self._config
File /home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litdata/streaming/config.py:213, in ChunksConfig.load(cls, cache_dir, serializers, remote_dir, item_loader)
210 if not os.path.exists(cache_index_filepath):
211 return None
--> 213 return ChunksConfig(cache_dir, serializers, remote_dir, item_loader)
File /home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litdata/streaming/config.py:61, in ChunksConfig.__init__(self, cache_dir, serializers, remote_dir, item_loader)
58 self._config["data_spec"] = treespec_loads(self._config["data_spec"])
60 self._item_loader.setup(self._config, self._chunks, serializers)
---> 61 self._intervals = self._item_loader.generate_intervals()
62 self._length = self._intervals[-1][-1]
63 self._downloader = None
File /home/woody/iwb0/iwb0003h/.conda/envs/litgpt/lib/python3.11/site-packages/litdata/streaming/item_loader.py:177, in TokensLoader.generate_intervals(self)
175 for chunk in self._chunks:
176 dim = chunk["dim"]
--> 177 num_blocks = dim // self._block_size
178 end += num_blocks
179 intervals.append((begin, end))
ZeroDivisionError: integer division or modulo by zero
However, setting the max_seq_length
e.g. to 10
, the code works just fine.
The example (https://github.com/Lightning-AI/litgpt?tab=readme-ov-file#continue-pretraining-an-llm) works fine on my machine but as soon as i replace with custom text files that each just contain one english language sentence with no special characters the example no longer works.
I don't understand what the difference between the provided data examples is, and my custom data. Is there some special formatting that i am not seeing?
which results in