Open ebektas opened 1 month ago
I suspect now there are some parameters you can pass with --data.* but pretrain -h doesn't show them
I think that's currently not possible, but I'm in favor of adding it as a dataset config and maybe having the default num_workers="auto"
. Since you have much more experience with the pretraining codes @awaelchli , what do you think? I'd be happy to add that.
@ebektas Does the issue occur when you use litgpt pretrain ...
or is this an issue you encounter when preparing the dataset, e.g., python litgpt/data/prepare_slimpajama.py ...
I am mainly asking because it looks like we already expose the number of workers for the pretraining itself:
⚡ main ~/litgpt litgpt pretrain --data.help litgpt.data.TinyLlama
usage: litgpt [--data.init_args.data_path DATA_PATH] [--data.init_args.seed SEED] [--data.init_args.num_workers NUM_WORKERS]
[--data.init_args.use_starcoder {true,false}]
The TinyLlama data module is composed of a mix of SlimPajama and Starcoder data:
--data.init_args.data_path DATA_PATH
The path to the data directory, containing two folders 'slimpajama' and 'starcoder' which are the output of
the preprocessing step done in advance. See the `tutorial/pretrain_tinyllama.md` for instructions. The path
can also be a remote path (e.g., s3://). (type: Union[str, Path], default: data)
--data.init_args.seed SEED
The random seed for shuffling the dataset. (type: int, default: 42)
--data.init_args.num_workers NUM_WORKERS
How many DataLoader processes to use for loading. (type: int, default: 8)
--data.init_args.use_starcoder {true,false}
Toggle for using Starcoder data. (type: bool, default: True)
FWIW, some data readers such as TextFiles
ignore the data.num_workers argument (no reference to self.num_workers
set in the constructor: https://github.com/Lightning-AI/litgpt/blob/2f2ea8ca44d1fe41bdeb3a06f49da06930272984/litgpt/data/text_files.py#L69), so this might also be a source of confusion. I understand that TextFiles is not recommended for large datasets, but this behavior was still confusing to me with a moderate number of files (~200). Happy to send a PR to use the passed in num_workers
(with the default being the current num_cpus-1)
@ebektas Does the issue occur when you use
litgpt pretrain ...
or is this an issue you encounter when preparing the dataset, e.g.,python litgpt/data/prepare_slimpajama.py ...
It is when using "litgpt pretrain ..." and the way I manipuleted the number of workers was to change the number of files.
The line above @fdalvi linked is:
I feel like is a bit presumptuous and causes problems for shared machines, highly virtual cpu servers etc.. I would rather have exact control over the worker number(and possibly keep default behaviour if user hasn't specified anything).
I still have a problem with higher number of workers, which I might carry over to the litdata github, the binarization process seems to hang after a certain number of batches so I had to keep num_workers(num of files) low. Ultimately how I got around this is to create my own script for binarizing and prebatching out of the data/prepare.. scripts, which could use a generic example for non-specific datasets or for all the input types for data type.
Could we pass number of litdata workers for through pretraining for the data preprocessing part? Or is there a way to pass it using config?
With over 100 files who are 1gb each, the script starts 100 workers but eventually some workers get done and the rest hangs. When I use fewer files to lower the amount of workers it seems to work.
model: llama-2-7b data type: TextFiles