epfml / landmark-attention

Landmark Attention: Random-Access Infinite Context Length for Transformers
https://arxiv.org/abs/2305.16300
Apache License 2.0
414 stars 36 forks source link

Will this work with Qlora and 4bit inference? #1

Open jordancole21 opened 1 year ago

jordancole21 commented 1 year ago

Super excited about this project! I'm in the process of reading the paper now! But just curious, are there any plans to make this work for 4bit or 8bit finetuning, so it can be applied to the larger opensource models?

martinjaggi commented 1 year ago

yes currently it relies on standard finetuning to transform an existing model to longer context. conceptually, it should be perfectly reasonable to try LoRA-finetuning instead for more memory efficiency, or even Q-LoRA.

we're playing with LoRA at the moment. one change you'll have to make is to also unfreeze the initial embedding layer, to allow it to learn the new landmark tokens. otherwise, should work out of the box.

let us know if you manage to get it running by combining the two codebases. we can keep the issue open if other people share their experience as well

juanps90 commented 1 year ago

Do you think it should be doable with 4 bit finetuning as implemented on this repo?

https://github.com/johnsmith0031/alpaca_lora_4bit

How big of a dataset would be necessary for the finetuning to work past the 2048 token mark, assuming a LoRA approach?

Alignment-Lab-AI commented 1 year ago

Do you think it should be doable with 4 bit finetuning as implemented on this repo?

https://github.com/johnsmith0031/alpaca_lora_4bit

How big of a dataset would be necessary for the finetuning to work past the 2048 token mark, assuming a LoRA approach?

https://discord.gg/n9hXaBPWxx

on my server i have a system set up and this is one of our active problems were working on, i plan on figuring out how to alter this to pass custom datasets through the tokenization system on this and run it with axolotl which has built in qlora functionality

its actually a very cool development stack that involves samantha which is a new model developed by ehartford. the server has some of the more notable open sourced researchers there and if you think you can figure out how to combine the requisite elements were paying for compute to run experiments and make attempts, including h100s or clusters of a100's if needed

honestly id happily pay for you to train a 65b llama fine tune if you can figure how to do it with custom datasets on this repo

juanps90 commented 1 year ago

Do you think it should be doable with 4 bit finetuning as implemented on this repo? https://github.com/johnsmith0031/alpaca_lora_4bit How big of a dataset would be necessary for the finetuning to work past the 2048 token mark, assuming a LoRA approach?

https://discord.gg/n9hXaBPWxx

on my server i have a system set up and this is one of our active problems were working on, i plan on figuring out how to alter this to pass custom datasets through the tokenization system on this and run it with axolotl which has built in qlora functionality

its actually a very cool development stack that involves samantha which is a new model developed by ehartman. the server has some of the more notable open sourced researchers there and if you think you can figure out how to combine the requisite elements were paying for compute to run experiments and make attempts, including h100s or clusters of a100's if needed

honestly id happily pay for you to train a 65b llama fine tune if you can figure how to do it with custom datasets on this repo

Thank you very much. I don't feel qualified enough to take on such a challenge, maybe a more knowledgeable person can do this.

I joined the discord you sent and will be paying attention to any developments around increased context length on LLaMA models.

eugenepentland commented 1 year ago

I am currently training a 7B TheBloke-WizardLM-7B-HF model. I worked around all of the issues to get it up and running. It's presently generating the tokens for the original dataset. If all goes well and I'm confident in the results, I'll send you a message about seeing if I can work with you to get a 65B model.

Alignment-Lab-AI commented 1 year ago

I am currently training a 7B TheBloke-WizardLM-7B-HF model. I worked around all of the issues to get it up and running. It's presently generating the tokens for the original dataset. If all goes well and I'm confident in the results, I'll send you a message about seeing if I can work with you to get a 65B model.

keep me posted!

eugenepentland commented 1 year ago

Just wanted to give a status update. I've been able to create a QLoRA but haven't seen any improvement as of yet. There were a lot of settings I had to tweak to get it running and I'm not sure if the issue is settings I've tweaked or just from not training long enough.

I will release a fork later today with what I've done and an explainaition of the issues I'm running into if anyone else wants to take a crack at it as well.

Alignment-Lab-AI commented 1 year ago

Just wanted to give a status update. I've been able to create a QLoRA but haven't seen any improvement as of yet. There were a lot of settings I had to tweak to get it running and I'm not sure if the issue is settings I've tweaked or just from not training long enough.

I will release a fork later today with what I've done and an explainaition of the issues I'm running into if anyone else wants to take a crack at it as well.

ill happily take a look, maybe we can crack it if we work at it as a unit how was the models behavior with just the landmark?

windprak commented 1 year ago

I could train 65b on 4 node 32GPU if someone wants to help me with the training script

eugenepentland commented 1 year ago

I worked with @Alignment-Lab-AI and we were able to reproduce the results as described in the paper. Here is a guide if you just want to be able to test out landmark attention (not qlora yet, that's still in the works)

This guide is what for spinning up the model on lambda labs, so you might be able to skip some of the steps. Leaving all of the information in here just in case. This was done at 3 AM yesterday so there may be an issue you run into when setting up that I forgot about. If you have any issues let me know.

Python version 3.11 Cuda Version 12 H100 Graphics card 80GB VRAM Ubuntu

Getting the models

mkdir models cd models When cloning the repos, you may need to figure out how to get git lfs install working. I had to do something but I don't remember what I did git clone https://huggingface.co/epfml/landmark-attention-llama7b-wdiff git clone https://huggingface.co/huggyllama/llama-7b

Installing the required libraries

(If you don't have conda) wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-Linux-x86_64.sh chmod -v +x Miniconda*.sh ./Miniconda3-py39_4.12.0-Linux-x86_64.sh conda create -n landmark conda activate landmark conda install pytorch -c pytorch cd landmark-attention/ pip install -r requirements.txt cd llama/ pip install -r requirements.txt pip install sentencepiece

Creating the tuned llama file. This is how the merged model is created

cd /landmark-attention/llama python weight_diff.py recover --path_diff /home/ubuntu/models/landmark-attention-llama7b-wdiff/ --path_raw /home/ubuntu/models/llama-7b --path_tuned /home/ubuntu/models/llama-tuned

Running Inference (run_test.py) If you want to run it on your own prompt, the easiest way is to comment out the llama_base_pipe and change the generate_prompt function. You need to use their llama_mem_pipe for inference, so you aren't going to be able to take the model and have it function as expected in oobooga or something similar as of yet

cd /landmark-attention/llama python run_test.py

We were able to get context up to 25k working and getting the correct answer, but it is SLOW. I think there are some optimizations that still need to be made to improve the performance. More comprehensive testing for evaluation needs to be completed as well.

windprak commented 1 year ago

Yes, I already did the 7B reproduction with 8 A10080GBs but I want to also try using larger models up to 65B running on 4 nodes. I guess I have to use deepspeed like shown in Llama-X. Is anybody is up for some collaboration?

eugenepentland commented 1 year ago

You can join our discord and we can work out the details. We are trying to get this in the hands of people as soon as possible.

https://discord.gg/fPUqTUWG

Alignment-Lab-AI commented 1 year ago

Yes, I already did the 7B reproduction with 8 A10080GBs but I want to also try using larger models up to 65B running on 4 nodes. I guess I have to use deepspeed like shown in Llama-X. Is anybody is up for some collaboration?

There's a lot of optimization left to do, you can reach us at toasts discord or at mine

https://discord.gg/X7E34Am9sn

eugenepentland commented 1 year ago

Just giving a status update, we've been able to train a 3B model using 20GB of VRAM, and 7B model using 29GB of VRAM, but have not trained long enough to get results and are still working out how it will get merged with the original weights.

I've made a fork that will keep up with my progress on this. https://github.com/eugenepentland/landmark-attention-qlora

eugenepentland commented 1 year ago

Hey everyone, I got it working! We are running a longer training overnight, but from training 500 steps on the 7B model, we were getting up to 7k tokens (I tried 32k but got OOM). The accuracy wasn't great (60% accurate at 7k tokens) from lack of training so we haven't released the model yet. We should have something released tomorrow though.

Exciting stuff!

mkrima commented 1 year ago

Hi.

Thank you for following up on this. Looking forward to hear more about the final accuracy.

Regarding OOM, are you using offloading?

juanps90 commented 1 year ago

Hey everyone, I got it working! We are running a longer training overnight, but from training 500 steps on the 7B model, we were getting up to 7k tokens (I tried 32k but got OOM). The accuracy wasn't great (60% accurate at 7k tokens) from lack of training so we haven't released the model yet. We should have something released tomorrow though.

Exciting stuff!

Very exciting indeed! Is this a full finetuning or QLora ?

eugenepentland commented 1 year ago

QLoRA. We weren't using offloading so that's probably why we were getting OOM. We haven't had much luck getting better performance than the initial 60% accuracy. We trained a 13B, 7B, 3B and all of them had the same issue. Still in the process of tweaking all of the LoRA settings to see if we can improve the results. The goal was to make it so larger models, ie 30B or 65B could be trained as well at a low cost. For doing all 3 of those trainings today, we've only spent about $50 on a H100.

mkrima commented 1 year ago

Thanks for the update. I took a quick pass at your code and it seems the embedding layer is frozen during training. This can be a problem since we are adding a new token and at least this token's embedding needs to be trained.

eugenepentland commented 1 year ago

Would you be able to provide some example code or another repo on how that is done? I'll take a look myself but this is all still very new to me. I'm just a guy with some free time and used to staring at things until I can get them working.

mkrima commented 1 year ago

I don't have an example unfortunately (but if someone else does, please share). I also have not done this before myself. But, as a guess, I think you should be able to achieve this by updating the find_all_linear_names function in your train_qlora.py and adding embed_tokens in LlamaModel to it (you need to find the right name for it).

eugenepentland commented 1 year ago

Okay thanks for the pointer, we'll see how it goes. Took me 3 days to get to this point so hopefully I'll have it figured out this weekend.

ehartford commented 1 year ago

Do you think it should be doable with 4 bit finetuning as implemented on this repo? https://github.com/johnsmith0031/alpaca_lora_4bit How big of a dataset would be necessary for the finetuning to work past the 2048 token mark, assuming a LoRA approach?

https://discord.gg/n9hXaBPWxx

on my server i have a system set up and this is one of our active problems were working on, i plan on figuring out how to alter this to pass custom datasets through the tokenization system on this and run it with axolotl which has built in qlora functionality

its actually a very cool development stack that involves samantha which is a new model developed by ehartman. the server has some of the more notable open sourced researchers there and if you think you can figure out how to combine the requisite elements were paying for compute to run experiments and make attempts, including h100s or clusters of a100's if needed

honestly id happily pay for you to train a 65b llama fine tune if you can figure how to do it with custom datasets on this repo

hey man it's ehartford not ehartman :-) hit me up if you need help.

Alignment-Lab-AI commented 1 year ago

hey man it's ehartford not ehartman :-) hit me up if you need help.

whoops! sorry about that, i was having a really late night that night

eugenepentland commented 1 year ago

It's all done now! You can check out my repo here: https://github.com/eugenepentland/landmark-attention-qlora

image

We trained a 7B and 13B model, and the 13B model appears to have equal or better performance than the fully fine tuned 7B base model. We tested each step 20 times. The majority of the work now is just properly evaluating the model beyond the test provided in the paper. All of the models can handle larger context than shown, we just ran out of memory on our GPU (Still haven't tried the CPU offloading).

@mkrima I would love to get in contact with you guys to talk about your work and see if there is anything we can do to help. I already have a few people that are evaluating the models now and will be providing some feedback. I also have lots of questions about possible improvements for the future!

ethanhs commented 1 year ago

Hi! I am currently running MMLU (soon BBH and HumanEval) on the WizardLM-Landmark model. Will report back once I get numbers! I'm currently using https://github.com/declare-lab/instruct-eval, wired up to code based on @eugenepentland's qlora evaluation code. I have found that it runs much slower, which I guess is to be expected.

martinjaggi commented 1 year ago

have you switched yet to the new triton kernel which we posted?

Alignment-Lab-AI commented 1 year ago

have you switched yet to the new triton kernel which we posted?

our most recent progress had a few roadblocks in regards to triton the python version required for the triton branch is incompatible with our automl system currently so development from that direction is presently halted until that is solved after switching to the new version there has been a roadblock in regards to reshaping input ids there was an issue a few days ago mentioned involving the model_max_length and mem_freq,

eugene and the other person primarily working on it have additionally been super busy for the last few days though interest is ongoing

eugenepentland commented 1 year ago

I haven't had any luck just trying to get your base landmark training on my local machine yet.

The first issue is the block size is defined as 63, and the max_model_size is 512, but I get an error where the max model size has to be divisible by the block size, so I dropped the max_model_size down to 504.

The issue I currently have not been able to resolve is when running the fused_landmark_attention function, I get the following error:

error: Number of elements must be power-of-two, but "tt.return"(%0) : (tensor<64x100xf32>) -> () doesn't follow the rule (6400) elements

So my query tensor is size 100 which is making it fail. The only thing I've changed is I'm running it on a smaller dataset for the sake of testing faster. (also changed to open_llama_3b so the RTX Quadro 8000 could run without issues. Any help would be great.

python3 train.py \ --model_name_or_path /home/toast/models/open_llama_3b \ --output_dir /home/toast/models/open_llama_3b/output\ --bf16 False \ --cache_dir /home/toast/hf-cache/ \ --num_train_epochs 1 \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 2000 \ --save_total_limit 2 \ --learning_rate 2e-5 \ --weight_decay 0.1 \ --model_max_length 1008 \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 False \ --max_steps 100 \ --use_flash True \

mkrima commented 1 year ago

Regarding the max_model_size I understand this is a bit confusing since max_model_size corresponds to the context length when use_flash is False whereas when use_flash is True, it corresponds to the number of non-landmark tokens (so for example in your settings the context size is actually 1008 + 1008/63 = 1024). I'll work on a patch so max_model_size will always be the context length which should resolve the first issue but in the meantime you are using the correct value.

Regarding the second issue, when using Triton, head dimension needs to be a power of two. Can you test with using 128 head dimension?

mkrima commented 1 year ago

Since you are using a pretrained model, you probably can not increase the head dimension directly. One solution is padding your key, query, and value vectors with zeros before passing it to the fused attention (which should not affect the output) and then dropping the additional dimension after the attention is done.

(By the way it's better to move this discussion to a new issue thread at some point since it's no longer about QLORA)

eugenepentland commented 1 year ago

I was able to fix that issue by padding my q, k, v tensors but I'm just getting an OOM error from triton now. I'll take a look into it later but I'm not sure this is something I will be able to fix in any kind of simple fashion. This is after I had already reduced the num_stages to 1.

triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 81920, Hardware limit: 65536. Reducing block sizes or num_stages may help.

(And if I run into further issues I will open a new issue)

mkrima commented 1 year ago

This can be solved by lowering the block size to 31. Alternatively if possible using bf16 instead of tf32 should also fix it (possibly tf16 might also work).

eugenepentland commented 1 year ago

I got training running once I set the block size to 15 (I'm on an older GPU that doesn't support bf16/ft32. Also, the head dimension was only 100 for openllama 3B, when I reran it on wizardLM7B already had the head dimension at 128.

Doing the training with a 7B model was using 25GB of VRAM using my QLora repo, but still required 42GB of VRAM at the beginning of training because of Triton. I'll need to take a look into it further, but I'll push the updates to my repo later tonight.

292916808 commented 1 year ago

Hi! I am currently running MMLU (soon BBH and HumanEval) on the WizardLM-Landmark model. Will report back once I get numbers! I'm currently using https://github.com/declare-lab/instruct-eval, wired up to code based on @eugenepentland's qlora evaluation code. I have found that it runs much slower, which I guess is to be expected.

@ethanhs Hi! Just wanted to follow up on this. Have you got any results for qlora on BBH?

ethanhs commented 1 year ago

Yeah, I'm pretty sure the results were significantly worse than the base model. I don't have the numbers anymore.