foundation-model-stack / fms-hf-tuning

🚀 Collection of tuning recipes with HuggingFace SFTTrainer and PyTorch FSDP.
Apache License 2.0
28 stars 48 forks source link

fix: utilities to post process checkpoint for LoRA #338

Closed Ssukriti closed 2 months ago

Ssukriti commented 2 months ago

Description of the change

utility function to post process checkpoint after LoRA tuning to convert to format required by vLLM. This will need to be called at end of LoRA tuning to allow inferencing on LoRA, for models for which we have added new tokens.

Since it is fast enough to load adapters.safetensors , added it as a post-processing function.

This PR adds a script that can be called after tuning to do the processing.

Related issue number

https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/1210 Details on vLLM issue: https://github.com/vllm-project/vllm/issues/2816#issuecomment-1934607503

Context: Embedding vectors for new tokens need to be placed in new_embeddings.safetensors and lm_head.weight should be deleted from adapter_model.safetensors as per https://github.com/vllm-project/vllm/issues/2816#issuecomment-1934607503

How to verify the PR

  1. Function called on llama model which was LoRA tuned - post-processed and being tested on vLLM
  2. Verified PR with every unique model architecture we support

Was the PR tested

github-actions[bot] commented 2 months ago

Thanks for making a pull request! 😃 One of the maintainers will review and advise on the next steps.

aluu317 commented 2 months ago

I added a commit to add unit test in test_merge_model_utils.py. Steps to run it (since Github Action skips it due to not having cuda);

test_merge_model_utils.py . [100%]

======================================== 1 passed in 3.50s =========================================

willmj commented 2 months ago

Note these tests are failing because they are running old unit tets (see changes in previous commit) Test fail (only 10 args): image What it should be running (11 args): image

willmj commented 2 months ago

Good news! llama3-8b, granite-3b-code-base, granite-34b and mistral-7b both inference with the changes in the branch! If you want to test for yourself: llama model location: /fmaas-integration-tests/tuning/output/llama3-8b_pre_trained/lora/20240920143503-save_model/ mistral model location: /fmaas-integration-tests/tuning/output/mistral-7b-v0.1/lora/20240919184910-t1-save/ granite 3b (llama) model location: /fmaas-integration-tests/tuning/output/granite-3b-code-base/lora/testtest-save granite 34b (gpt big code) model location: /fmaas-integration-tests/tuning/output/granite-34b-code-instruct/lora/twitter-20240920161824-save Llama

Screenshot 2024-09-20 at 3 37 38 PM

Mistral

Screenshot 2024-09-20 at 2 20 50 PM

Granite 3b

Screenshot 2024-09-20 at 4 12 27 PM

Granite 34b

Screenshot 2024-09-20 at 4 42 14 PM
Ssukriti commented 2 months ago

@fabianlim @kmehant the PR is ready for a second review. I had to make some changes since you last reviewed, which came out of the team's testing (@Abhishek-TAMU @willmj ). Describing changes here:

  1. I cannot rely on file 'added_tokens.json' , as the file is not consistently produced for the new PretrainedTokenizerFast tokenizer types. Only few tokenizer types result in that file being produced with the resize function. Hence I created my own artifact 'added_tokens_info.json' containing the information we need for post-processing - how many tokens were aded.

  2. To get information of how many new tokens were added while tuning, the train() function now returns additional metadata containing that information. The utility function for postprocessing consumes that metadata.

  3. I could not call the post-processing from main() , as @Abhishek-TAMU and @willmj pointed out errors due to multi-processing. We launch the main script using accelerate that uses multiple processes. This post-processing needs to happen only 1 time after training is complete, hence I added a new step and script to be called after tuning is complete using accelerate launch.

Next PR: to complete this contribution, we have to add the post-processing step in the build /accelerate_launch.py script under a flag so watsonX can utilize it. This will be done next.

NOTE: While there may be ways to figure out what tokens were added at end of tuning even without 'added_tokens.json' file, it was turning out to be complicated as some tokenizers just append new tokens to existing 'special_tokens.json' file. Also while tuning we have this information, so I just chose to write it to a file anyway.

Ssukriti commented 2 months ago

@anhuong @willmj I have added sufficient unit tests pertaining to this change. Lets wait on @ashokponkumar to check the PR as well.

Ssukriti commented 2 months ago

@ashokponkumar as explained on slack, we cannot remove addition of pad token as it is a rwquirement and known in open source. Without which we will run into this error https://stackoverflow.com/questions/70544129/transformers-asking-to-pad-but-the-tokenizer-does-not-have-a-padding-token

Most models in open source do not have pad token - new llama3.1, llama3, llama, allam, mixtral, mistral . Hence we have to add minimal 1 token for all these architectures, which we do in generic manner 'if pad token is None, set it'

This PR is thus doing the post-processing needed to handle addition of any token for LoRA inferencing on vLLM. Without this PR, LoRA inference on vLLM does not work for any of above architectures.

Even if we remove other tokens, like unk etc - which can be done in following PR and issue , the change is still needed for pad token. Hence better to keep the change generic and return number of added tokens from code