Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
9.24k stars 922 forks source link

Could we pass number of litdata workers in litgpt pretrain? #1500

Open ebektas opened 1 month ago

ebektas commented 1 month ago

Could we pass number of litdata workers for through pretraining for the data preprocessing part? Or is there a way to pass it using config?

With over 100 files who are 1gb each, the script starts 100 workers but eventually some workers get done and the rest hangs. When I use fewer files to lower the amount of workers it seems to work.

model: llama-2-7b data type: TextFiles

ebektas commented 1 month ago

I suspect now there are some parameters you can pass with --data.* but pretrain -h doesn't show them

rasbt commented 1 month ago

I think that's currently not possible, but I'm in favor of adding it as a dataset config and maybe having the default num_workers="auto". Since you have much more experience with the pretraining codes @awaelchli , what do you think? I'd be happy to add that.

rasbt commented 1 month ago

@ebektas Does the issue occur when you use litgpt pretrain ... or is this an issue you encounter when preparing the dataset, e.g., python litgpt/data/prepare_slimpajama.py ...

rasbt commented 1 month ago

I am mainly asking because it looks like we already expose the number of workers for the pretraining itself:

⚡ main ~/litgpt litgpt pretrain --data.help litgpt.data.TinyLlama
usage: litgpt [--data.init_args.data_path DATA_PATH] [--data.init_args.seed SEED] [--data.init_args.num_workers NUM_WORKERS]
              [--data.init_args.use_starcoder {true,false}]

The TinyLlama data module is composed of a mix of SlimPajama and Starcoder data:
  --data.init_args.data_path DATA_PATH
                        The path to the data directory, containing two folders 'slimpajama' and 'starcoder' which are the output of
                        the preprocessing step done in advance. See the `tutorial/pretrain_tinyllama.md` for instructions. The path
                        can also be a remote path (e.g., s3://). (type: Union[str, Path], default: data)
  --data.init_args.seed SEED
                        The random seed for shuffling the dataset. (type: int, default: 42)
  --data.init_args.num_workers NUM_WORKERS
                        How many DataLoader processes to use for loading. (type: int, default: 8)
  --data.init_args.use_starcoder {true,false}
                        Toggle for using Starcoder data. (type: bool, default: True)
fdalvi commented 1 month ago

FWIW, some data readers such as TextFiles ignore the data.num_workers argument (no reference to self.num_workers set in the constructor: https://github.com/Lightning-AI/litgpt/blob/2f2ea8ca44d1fe41bdeb3a06f49da06930272984/litgpt/data/text_files.py#L69), so this might also be a source of confusion. I understand that TextFiles is not recommended for large datasets, but this behavior was still confusing to me with a moderate number of files (~200). Happy to send a PR to use the passed in num_workers (with the default being the current num_cpus-1)

ebektas commented 1 month ago

@ebektas Does the issue occur when you use litgpt pretrain ... or is this an issue you encounter when preparing the dataset, e.g., python litgpt/data/prepare_slimpajama.py ...

It is when using "litgpt pretrain ..." and the way I manipuleted the number of workers was to change the number of files.

The line above @fdalvi linked is:

https://github.com/Lightning-AI/litgpt/blob/2f2ea8ca44d1fe41bdeb3a06f49da06930272984/litgpt/data/text_files.py#L68

I feel like is a bit presumptuous and causes problems for shared machines, highly virtual cpu servers etc.. I would rather have exact control over the worker number(and possibly keep default behaviour if user hasn't specified anything).

I still have a problem with higher number of workers, which I might carry over to the litdata github, the binarization process seems to hang after a certain number of batches so I had to keep num_workers(num of files) low. Ultimately how I got around this is to create my own script for binarizing and prebatching out of the data/prepare.. scripts, which could use a generic example for non-specific datasets or for all the input types for data type.