Closed IvoryTower800 closed 5 months ago
Hi, Thanks for Using EasyDeL! sure here are some tips:
max_length
flash_attention
["q_proj","o_proj","gate_proj","up_proj","down_proj","v_proj","k_proj"]
to ["q_proj", "o_proj", "v_proj", "k_proj"]
Thank you for your suggestions. I found that sequence sharding can have longer contexts. So, I want to try sequence sharding with flash attention to maximum the sequence length.
I followed your instruction on the README. I got the ValueError: Attention bias shape mismatch: expected (batch_size=1, num_heads=8, q_seq_len=1024, kv_seq_len=1024), got (1, 8, 8192, 8192)
.
# use these partition specs in case of not using custom sharding_axis_names and using sequence sharding with flash flash attention
query_partition_spec=PartitionSpec(("dp", "fsdp"), None, "sp", "tp"),
generation_query_partition_spec=PartitionSpec(("dp", "fsdp"), None, None, "tp"),
key_partition_spec=PartitionSpec(("dp", "fsdp"), None, "sp", "tp"),
value_partition_spec=PartitionSpec(("dp", "fsdp"), None, "sp", "tp"),
attention_partition_spec=PartitionSpec(("dp", "fsdp"), None,"sp", "tp"),
So what's the correct partition strategy to use sequence sharding with flash attention?
Flash attention works on fsdp version that means you should use at least 8 batch size
yes, I understand. But can flash attention works on sequence sharding using easydel?
I tried set batch size to 8 or more. it still says ValueError: Attention bias shape mismatch: expected (batch_size=1, num_heads=8, q_seq_len=1024, kv_seq_len=1024), got (1, 8, 8192, 8192)
.
When I set bias_partition_spec=PartitionSpec(("dp", "fsdp"), None, "sp", None),
, it says ValueError: Attention bias shape mismatch: expected (batch_size=1, num_heads=8, q_seq_len=1024, kv_seq_len=1024), got (1, 8, 1024, 8192)
You have to set sharding array axis to 1,-1,1,1 For that and ill make a way or re create algorithms to make it possible using flash attention using sequence sharding method ( it's already possible but not on kaggle you simply have to set tensor parallel and sequence parallel to same number like 1,1,4,4)
Thank you for your kind explaination. Will it be possible on kaggle in the near future?
yes soon it will be possible
and actually, the attention mechanism has improved and now it's faster and more efficient you can try that again by changing sharding array axis to 1,-1,1,1
@erfanzar Thank you! I'm looking forward to it. I found the training speed increased about 25% for gemma model when using normal attention now. Amazing.
Describe the bug Hi, I really appreciate your continued commitment to this project and make it better and better. I'm one of the people who benefit greatly. Thank you.
Now, I am trying to fine-tune the Yi-34B-Chat model using Kaggle's TPU but encounter insufficient memory errors.
When fine-tuning with 16-bit precision using transformers, QLora, and Flash Attention 2 on an A100 40G GPU, the process consumes about 33G of VRAM from my own experience.
Although a TPU VM v3-8 offers 128G of RAM, I'm unable to complete the fine-tune process due to memory constraints.
Previously, I fine-tuned the Yi-34B-Chat model extensively on an A100 40G, but I no longer have access to that machine. Now, I need to continue fine-tuning on a TPU v3-8 with the same lora parameters and sequence length.
Is it possible? Would greatly appreciate some tips on reducing memory usage to fit the constraints of the TPU.
To Reproduce