fix: utilities to post process checkpoint for LoRA

Ssukriti commented 2 months ago

Description of the change

utility function to post process checkpoint after LoRA tuning to convert to format required by vLLM. This will need to be called at end of LoRA tuning to allow inferencing on LoRA, for models for which we have added new tokens.

Since it is fast enough to load adapters.safetensors , added it as a post-processing function.

This PR adds a script that can be called after tuning to do the processing.

Related issue number

https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/1210 Details on vLLM issue: https://github.com/vllm-project/vllm/issues/2816#issuecomment-1934607503

Context: Embedding vectors for new tokens need to be placed in new_embeddings.safetensors and lm_head.weight should be deleted from adapter_model.safetensors as per https://github.com/vllm-project/vllm/issues/2816#issuecomment-1934607503

How to verify the PR

Function called on llama model which was LoRA tuned - post-processed and being tested on vLLM
Verified PR with every unique model architecture we support

Was the PR tested

[ ] I have added >=1 unit test(s) for every new method I have added.
[ ] I have ensured all unit tests pass

github-actions[bot] commented 2 months ago

Thanks for making a pull request! 😃 One of the maintainers will review and advise on the next steps.

aluu317 commented 2 months ago

I added a commit to add unit test in test_merge_model_utils.py. Steps to run it (since Github Action skips it due to not having cuda);

Start a sleep pod with volume mount to our fmaas-integration-tests COS bucket.

Change the value in test file just for this run:

DUMMY_TUNED_LLAMA_WITH_ADDED_TOKENS = "/testing/tuning/output/llama_dummy_LoRA/checkpoint-35"

Copy test file into pod:

oc cp tests/utils/test_merge_model_utils.py angel-sleep-pod-sft-trainer:/home/tuning/.local/lib/python3.11/site-packages/tuning/utils/test_merge_model_utils.py

In sleep pod, pip install pytest. Then run:


[tuning@angel-sleep-pod-sft-trainer app]$ cd /home/tuning/.local/lib/python3.11/site-packages/tuning/utils
[tuning@angel-sleep-pod-sft-trainer utils]$ pytest test_merge_model_utils.py 
======================================= test session starts ========================================
platform linux -- Python 3.11.7, pytest-8.3.3, pluggy-1.5.0
rootdir: /home/tuning/.local/lib/python3.11/site-packages/tuning/utils
collected 1 item

test_merge_model_utils.py . [100%]

======================================== 1 passed in 3.50s =========================================

willmj commented 2 months ago

Note these tests are failing because they are running old unit tets (see changes in previous commit) Test fail (only 10 args): What it should be running (11 args):

willmj commented 2 months ago

Good news! llama3-8b, granite-3b-code-base, granite-34b and mistral-7b both inference with the changes in the branch! If you want to test for yourself: llama model location: /fmaas-integration-tests/tuning/output/llama3-8b_pre_trained/lora/20240920143503-save_model/ mistral model location: /fmaas-integration-tests/tuning/output/mistral-7b-v0.1/lora/20240919184910-t1-save/ granite 3b (llama) model location: /fmaas-integration-tests/tuning/output/granite-3b-code-base/lora/testtest-save granite 34b (gpt big code) model location: /fmaas-integration-tests/tuning/output/granite-34b-code-instruct/lora/twitter-20240920161824-save Llama

Mistral

Granite 3b

Granite 34b

Ssukriti commented 2 months ago

@fabianlim @kmehant the PR is ready for a second review. I had to make some changes since you last reviewed, which came out of the team's testing (@Abhishek-TAMU @willmj ). Describing changes here:

I cannot rely on file 'added_tokens.json' , as the file is not consistently produced for the new PretrainedTokenizerFast tokenizer types. Only few tokenizer types result in that file being produced with the resize function. Hence I created my own artifact 'added_tokens_info.json' containing the information we need for post-processing - how many tokens were aded.
To get information of how many new tokens were added while tuning, the train() function now returns additional metadata containing that information. The utility function for postprocessing consumes that metadata.
I could not call the post-processing from main() , as @Abhishek-TAMU and @willmj pointed out errors due to multi-processing. We launch the main script using accelerate that uses multiple processes. This post-processing needs to happen only 1 time after training is complete, hence I added a new step and script to be called after tuning is complete using accelerate launch.

Next PR: to complete this contribution, we have to add the post-processing step in the build /accelerate_launch.py script under a flag so watsonX can utilize it. This will be done next.

NOTE: While there may be ways to figure out what tokens were added at end of tuning even without 'added_tokens.json' file, it was turning out to be complicated as some tokenizers just append new tokens to existing 'special_tokens.json' file. Also while tuning we have this information, so I just chose to write it to a file anyway.

Ssukriti commented 2 months ago

@anhuong @willmj I have added sufficient unit tests pertaining to this change. Lets wait on @ashokponkumar to check the PR as well.

Ssukriti commented 2 months ago

@ashokponkumar as explained on slack, we cannot remove addition of pad token as it is a rwquirement and known in open source. Without which we will run into this error https://stackoverflow.com/questions/70544129/transformers-asking-to-pad-but-the-tokenizer-does-not-have-a-padding-token

Most models in open source do not have pad token - new llama3.1, llama3, llama, allam, mixtral, mistral . Hence we have to add minimal 1 token for all these architectures, which we do in generic manner 'if pad token is None, set it'

This PR is thus doing the post-processing needed to handle addition of any token for LoRA inferencing on vLLM. Without this PR, LoRA inference on vLLM does not work for any of above architectures.

Even if we remove other tokens, like unk etc - which can be done in following PR and issue , the change is still needed for pad token. Hence better to keep the change generic and return number of added tokens from code

foundation-model-stack / fms-hf-tuning