Closed chromecast56 closed 1 month ago
Do you get the same error when using the default/original c4
dataset ?
I get a different error:
[rank0]: File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]: File "<frozen runpy>", line 88, in _run_code
[rank0]: File "/home/jamesliu/axolotl/src/axolotl/cli/train.py", line 73, in <module>
[rank0]: fire.Fire(do_cli)
[rank0]: File "/home/jamesliu/anaconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
[rank0]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jamesliu/anaconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
[rank0]: component, remaining_args = _CallAndUpdateTrace(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jamesliu/anaconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank0]: component = fn(*varargs, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jamesliu/axolotl/src/axolotl/cli/train.py", line 39, in do_cli
[rank0]: return do_train(parsed_cfg, parsed_cli_args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jamesliu/axolotl/src/axolotl/cli/train.py", line 68, in do_train
[rank0]: return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jamesliu/axolotl/src/axolotl/train.py", line 170, in train
[rank0]: trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank0]: File "/home/jamesliu/anaconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 1932, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jamesliu/anaconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 2230, in _inner_training_loop
[rank0]: for step, inputs in enumerate(epoch_iterator):
[rank0]: File "/home/jamesliu/anaconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/data_loader.py", line 677, in __iter__
[rank0]: next_batch, next_batch_info = self._fetch_batches(main_iterator)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jamesliu/anaconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/data_loader.py", line 635, in _fetch_batches
[rank0]: raise RuntimeError(
[rank0]: RuntimeError: You can't use batches of different size with `dispatch_batches=True` or when using an `IterableDataset`.either pass `dispatch_batches=False` and have each process fetch its own batch or pass `split_batches=True`. By doing so, the main process will fetch a full batch and slice it into `num_processes` batches for each process.```
@winglian As a bandaid, would there be a way for me to fully load the smaller pretrain dataset jamesliu1/c4
using datasets:
instead of pretrained_dataset
(my usecase is mending after applying a slightly lossy compression to a base model)? The only issue is that there isn't a pretrain
type in datasets:
.
i'm facing the same issue!
@winglian As a bandaid, would there be a way for me to fully load the smaller pretrain dataset
jamesliu1/c4
usingdatasets:
instead ofpretrained_dataset
(my usecase is mending after applying a slightly lossy compression to a base model)? The only issue is that there isn't apretrain
type indatasets:
.
for small pretrain style datasets, you can use type: completion
I had the same issues , on 06-Aug-2024 , I am able to run with type: completion , but 07-Aug-2024 , I am keeping get the same error of "CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call " .
I tried to go back the commit of 06-Aug-2024 , it works again .
git clone https://github.com/OpenAccess-AI-Collective/axolotl
cd axolotl
git fetch origin
git checkout 203816f7b4de020c40708e4e61847b0716189380
Please check that this issue hasn't been reported before.
Expected Behavior
Should run without errors
Current behaviour
Steps to reproduce
Latest commit
78e12f8
, using pip/conda,torch==2.3.1
, CUDA 12.2.Run command
accelerate launch -m axolotl.cli.train examples/tiny-llama/pretrain.yml
(slightly modified, see yml below). Running on an 8xH100 machine.Config yaml
Possible solution
Interestingly, running with a SFT dataset does fine (eg. the commented tatsu-lab/alpaca). Not sure what the difference is for the pretraining case. Any help is appreciated!
Which Operating Systems are you using?
Python Version
3.11
axolotl branch-commit
main/78e12f8
Acknowledgements