Closed ZLKong closed 8 months ago
Also, the data preparation seems to take a long time, are you planning to provide the pre-processed minipile dataset?
Hello, @ZLKong
It is recommended to split the dataset into 4 splits when utilizing 4 GPUs. However, an alternative approach would be to split the dataset into 8 splits and execute the script twice. Since the processed minipile dataset is too large (~600GB), you could sample a small subset and execute the script.
I see, thanks!
Hello, @ZLKong
It is recommended to split the dataset into 4 splits when utilizing 4 GPUs. However, an alternative approach would be to split the dataset into 8 splits and execute the script twice. Since the processed minipile dataset is too large (~600GB), you could sample a small subset and execute the script.
I have a general question, I now have the "cuda out of memory" issue when running with 8x 24G 3090GPUs, Is there a way to reduce the memory size for each GPU? I see that the batch size has already been set to 1 in your script below:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
deepspeed --master_port=20001 ./src/train.py \
--training_mode full \
--deepspeed ./config/zero_stage2_config.json \
--model_name_or_path "<path_to_llama_2_7b>" \
--output_dir "<path_to_save_fusellm_7b>" \
--model_max_length 2048 \
--logging_steps 1 \
--save_strategy steps \
--save_steps 500 \
--save_total_limit 1 \
--evaluation_strategy steps \
--per_device_eval_batch_size 1 \
...
@ZLKong You can change the DeepSpeed configuration to Zero2 Offload or Zero3, but the best solution would be to use LoRA or QLoRA.
Thanks for your advice!
In the second step:
Get representations for each LLM We split the dataset into 8 splits, then process each split on a GPU.
Does the split number depend on the number of GPU we are running?
For example, if we have only 4 GPU, do I spit it into 4 splits?
Thanks