Closed yixuantt closed 11 months ago
Hi,
Would you please show me the entire script you used?
Regards, Yukang Chen
Sure.
CMD="torchrun --nproc_per_node 7 --nnodes 1 \
--master_addr $MASTER_ADDR --master_port $MASTER_PORT \
supervised-fine-tune.py \
--model_name_or_path "meta-llama/Llama-2-70b-hf" \
--bf16 True \
--output_dir "" \
--cache_dir ""\
--model_max_length 8192 \
--use_flash_attn True \
--data_path "data/fin_merged.json" \
--low_rank_training True \
--num_train_epochs 10 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 4 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 2 \
--learning_rate 2e-5 \
--weight_decay 0.0 \
--warmup_steps 20 \
--lr_scheduler_type "constant_with_warmup" \
--logging_steps 1 \
--deepspeed "ds_configs/stage3.json" \
--tf32 True
"
I am having the same issue. I am doing supervised fine tuning on a llama 2 7b 32k merged version. With model type as llama. Same params as the SFT tutorial.
I'm having the same issue and I resolved it by padding inputs to multiple of 4. The patched implementation appears to expect the input to be evenly divided into 4 groups.
@zhoukezi could you please share sample code? Thank you!
@marizu9 I'm using the code here with my own training code, so I just call tokenizer like this:
tokenized = tokenizer(text, padding=True, pad_to_multiple_of=4)
I'm not sure how to modify supervised-fine-tune.py
, but it seems like the script only uses padding="longest"
(which is equivalent to padding=True
), so adding pad_to_multiple_of=4
might be helpful.
Hi @zhoukezi @yixuantt @marizu9,
There is no need for padding, which will introduce redundant computation.
I provide a new version implementation of S2 Attn in https://github.com/dvlab-research/LongLoRA/blob/61d469085de348b53e480125ca6fb43f19dc0ed9/llama_attn_replace_sft.py#L24
Please git clone again and try this new sft code. I have tested it on a 7B model and it works well. Please tell me if there are an y other issues.
Regards, Yukang Chen
I will close this issue as it has been inactivate for several days. Please feel free to reopen it if there are any others to discuss.
你好,gptneox的代码也会遇到这个问题,我参考llama的pr修改是ok的吗?辛苦也更新下gptneox的代码
@yukang2017 I tried new sft code. Here is another error.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1f588664d7 in /home//.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f1f5883036b in /home//.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f1f5890afa8 in /home//.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f1f597f2590 in /home//.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f1f597f5b68 in /home//.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x227 (0x7f1f597f70b7 in /home//.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xd3e79 (0x7f1f9a6b4e79 in /home//.conda/envs/utorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x7e65 (0x7f1fd615ee65 in /lib64/libpthread.so.0)
frame #8: clone + 0x6d (0x7f1fd577e88d in /lib64/libc.so.6)
this is still an issue for GPT NEO
I am facing an error while I am finetuning: