dvlab-research / LongLoRA

Code and documents of LongLoRA and LongAlpaca (ICLR 2024 Oral)
http://arxiv.org/abs/2309.12307
Apache License 2.0
2.6k stars 267 forks source link

Error in FineTuning #36

Closed yixuantt closed 11 months ago

yixuantt commented 11 months ago

I am facing an error while I am finetuning:

    raise ValueError("q_len %d should be divisible by group size %d."%(q_len, group_size))
ValueError: q_len 583 should be divisible by group size 145.
yukang2017 commented 11 months ago

Hi,

Would you please show me the entire script you used?

Regards, Yukang Chen

yixuantt commented 11 months ago

Sure.

CMD="torchrun --nproc_per_node 7 --nnodes 1 \
    --master_addr $MASTER_ADDR --master_port $MASTER_PORT \
    supervised-fine-tune.py \
    --model_name_or_path "meta-llama/Llama-2-70b-hf" \
    --bf16 True \
    --output_dir "" \
    --cache_dir ""\
    --model_max_length 8192 \
    --use_flash_attn True \
    --data_path "data/fin_merged.json" \
    --low_rank_training True \
    --num_train_epochs 10  \
    --per_device_train_batch_size 2     \
    --per_device_eval_batch_size 2     \
    --gradient_accumulation_steps 4     \
    --evaluation_strategy "no"     \
    --save_strategy "steps"     \
    --save_steps 1000     \
    --save_total_limit 2     \
    --learning_rate 2e-5     \
    --weight_decay 0.0     \
    --warmup_steps 20     \
    --lr_scheduler_type "constant_with_warmup"     \
    --logging_steps 1     \
    --deepspeed "ds_configs/stage3.json" \
    --tf32 True
    "
marizu9 commented 11 months ago

I am having the same issue. I am doing supervised fine tuning on a llama 2 7b 32k merged version. With model type as llama. Same params as the SFT tutorial.

zhoukezi commented 11 months ago

I'm having the same issue and I resolved it by padding inputs to multiple of 4. The patched implementation appears to expect the input to be evenly divided into 4 groups.

marizu9 commented 11 months ago

@zhoukezi could you please share sample code? Thank you!

zhoukezi commented 11 months ago

@marizu9 I'm using the code here with my own training code, so I just call tokenizer like this:

tokenized = tokenizer(text, padding=True, pad_to_multiple_of=4)

I'm not sure how to modify supervised-fine-tune.py, but it seems like the script only uses padding="longest" (which is equivalent to padding=True), so adding pad_to_multiple_of=4 might be helpful.

https://github.com/dvlab-research/LongLoRA/blob/291ba2c16f8ae36d687bf1c6b68db1ab577ead41/supervised-fine-tune.py#L121-L142

yukang2017 commented 11 months ago

Hi @zhoukezi @yixuantt @marizu9,

There is no need for padding, which will introduce redundant computation.

I provide a new version implementation of S2 Attn in https://github.com/dvlab-research/LongLoRA/blob/61d469085de348b53e480125ca6fb43f19dc0ed9/llama_attn_replace_sft.py#L24

Please git clone again and try this new sft code. I have tested it on a 7B model and it works well. Please tell me if there are an y other issues.

Regards, Yukang Chen

yukang2017 commented 11 months ago

I will close this issue as it has been inactivate for several days. Please feel free to reopen it if there are any others to discuss.

Klein73 commented 11 months ago

你好,gptneox的代码也会遇到这个问题,我参考llama的pr修改是ok的吗?辛苦也更新下gptneox的代码

yixuantt commented 11 months ago

@yukang2017 I tried new sft code. Here is another error.

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1f588664d7 in /home//.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f1f5883036b in /home//.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f1f5890afa8 in /home//.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f1f597f2590 in /home//.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f1f597f5b68 in /home//.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x227 (0x7f1f597f70b7 in /home//.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xd3e79 (0x7f1f9a6b4e79 in /home//.conda/envs/utorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x7e65 (0x7f1fd615ee65 in /lib64/libpthread.so.0)
frame #8: clone + 0x6d (0x7f1fd577e88d in /lib64/libc.so.6)
bdytx5 commented 11 months ago

this is still an issue for GPT NEO