18907305772 / FuseAI

FuseAI Project
https://huggingface.co/FuseAI
76 stars 34 forks source link

Regarding MiniPile dataset splitting #5

Closed ZLKong closed 8 months ago

ZLKong commented 9 months ago

In the second step:

Get representations for each LLM We split the dataset into 8 splits, then process each split on a GPU.

Does the split number depend on the number of GPU we are running?

For example, if we have only 4 GPU, do I spit it into 4 splits?

Thanks

ZLKong commented 9 months ago

Also, the data preparation seems to take a long time, are you planning to provide the pre-processed minipile dataset?

18907305772 commented 9 months ago

Hello, @ZLKong

It is recommended to split the dataset into 4 splits when utilizing 4 GPUs. However, an alternative approach would be to split the dataset into 8 splits and execute the script twice. Since the processed minipile dataset is too large (~600GB), you could sample a small subset and execute the script.

ZLKong commented 9 months ago

I see, thanks!

ZLKong commented 8 months ago

Hello, @ZLKong

It is recommended to split the dataset into 4 splits when utilizing 4 GPUs. However, an alternative approach would be to split the dataset into 8 splits and execute the script twice. Since the processed minipile dataset is too large (~600GB), you could sample a small subset and execute the script.

I have a general question, I now have the "cuda out of memory" issue when running with 8x 24G 3090GPUs, Is there a way to reduce the memory size for each GPU? I see that the batch size has already been set to 1 in your script below:


export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

deepspeed --master_port=20001 ./src/train.py \
  --training_mode full \
  --deepspeed ./config/zero_stage2_config.json \
  --model_name_or_path "<path_to_llama_2_7b>" \
  --output_dir "<path_to_save_fusellm_7b>" \
  --model_max_length 2048 \
  --logging_steps 1 \
  --save_strategy steps \
  --save_steps 500 \
  --save_total_limit 1 \
  --evaluation_strategy steps \
  --per_device_eval_batch_size 1 \
...
18907305772 commented 8 months ago

@ZLKong You can change the DeepSpeed configuration to Zero2 Offload or Zero3, but the best solution would be to use LoRA or QLoRA.

ZLKong commented 8 months ago

Thanks for your advice!