Closed cabal-daniel closed 11 months ago
also tried setting micro_batch_size to 1
Does it work on a single GPU? In my experience, when I saw the RuntimeError: generator raised StopIteration
error, that was usually because I passed it the wrong data folder.
Yeah actually I found the issue running against the sample was to only use the common crawl data set. Was passing in the right folder. Closing the issue...
How did we end up resolving this? @cabal-daniel @rasbt
Hi, I ran into the same problem with RedPajama-sample datasets. Could you please tell me how did you solve the problem? @cabal-daniel
Hi, I ran into the same problem with RedPajama-sample datasets. Could you please tell me how did you solve the problem? @cabal-daniel
Hi, if you look into the code in lit_llama/packed_dataset.py, you will notice that the sample datasets only have 12 bin files. If you set device_num = 4 (by default), then each device only has 3 bin files. There is an error "if self._n_chunks > len(self._filenames[self._file_idx:]):" , which is 4 > 3 in the default runtime, so there would be an error. If you set number devices = 2, there would be no problem.
Running the pre-training script against the red pajama sample with 4 A100 80gbs on a single node. Per the advice given in this issue: https://github.com/Lightning-AI/lit-llama/issues/301, I reduced
max_iters
to 1 and this error is still here. Any ideas?