Closed Ssukriti closed 2 months ago
Thanks for making a pull request! 😃 One of the maintainers will review and advise on the next steps.
I added a commit to add unit test in test_merge_model_utils.py
. Steps to run it (since Github Action skips it due to not having cuda);
fmaas-integration-tests
COS bucket.DUMMY_TUNED_LLAMA_WITH_ADDED_TOKENS = "/testing/tuning/output/llama_dummy_LoRA/checkpoint-35"
oc cp tests/utils/test_merge_model_utils.py angel-sleep-pod-sft-trainer:/home/tuning/.local/lib/python3.11/site-packages/tuning/utils/test_merge_model_utils.py
pip install pytest
. Then run:
[tuning@angel-sleep-pod-sft-trainer app]$ cd /home/tuning/.local/lib/python3.11/site-packages/tuning/utils
[tuning@angel-sleep-pod-sft-trainer utils]$ pytest test_merge_model_utils.py
======================================= test session starts ========================================
platform linux -- Python 3.11.7, pytest-8.3.3, pluggy-1.5.0
rootdir: /home/tuning/.local/lib/python3.11/site-packages/tuning/utils
collected 1 item
test_merge_model_utils.py . [100%]
======================================== 1 passed in 3.50s =========================================
Note these tests are failing because they are running old unit tets (see changes in previous commit) Test fail (only 10 args): What it should be running (11 args):
Good news! llama3-8b, granite-3b-code-base, granite-34b and mistral-7b both inference with the changes in the branch!
If you want to test for yourself:
llama model location: /fmaas-integration-tests/tuning/output/llama3-8b_pre_trained/lora/20240920143503-save_model/
mistral model location: /fmaas-integration-tests/tuning/output/mistral-7b-v0.1/lora/20240919184910-t1-save/
granite 3b (llama) model location: /fmaas-integration-tests/tuning/output/granite-3b-code-base/lora/testtest-save
granite 34b (gpt big code) model location: /fmaas-integration-tests/tuning/output/granite-34b-code-instruct/lora/twitter-20240920161824-save
Llama
Mistral
Granite 3b
Granite 34b
@fabianlim @kmehant the PR is ready for a second review. I had to make some changes since you last reviewed, which came out of the team's testing (@Abhishek-TAMU @willmj ). Describing changes here:
I cannot rely on file 'added_tokens.json' , as the file is not consistently produced for the new PretrainedTokenizerFast tokenizer types. Only few tokenizer types result in that file being produced with the resize function. Hence I created my own artifact 'added_tokens_info.json' containing the information we need for post-processing - how many tokens were aded.
To get information of how many new tokens were added while tuning, the train() function now returns additional metadata containing that information. The utility function for postprocessing consumes that metadata.
I could not call the post-processing from main() , as @Abhishek-TAMU and @willmj pointed out errors due to multi-processing. We launch the main script using accelerate that uses multiple processes. This post-processing needs to happen only 1 time after training is complete, hence I added a new step and script to be called after tuning is complete using accelerate launch.
Next PR: to complete this contribution, we have to add the post-processing step in the build /accelerate_launch.py script under a flag so watsonX can utilize it. This will be done next.
NOTE: While there may be ways to figure out what tokens were added at end of tuning even without 'added_tokens.json' file, it was turning out to be complicated as some tokenizers just append new tokens to existing 'special_tokens.json' file. Also while tuning we have this information, so I just chose to write it to a file anyway.
@anhuong @willmj I have added sufficient unit tests pertaining to this change. Lets wait on @ashokponkumar to check the PR as well.
@ashokponkumar as explained on slack, we cannot remove addition of pad token as it is a rwquirement and known in open source. Without which we will run into this error https://stackoverflow.com/questions/70544129/transformers-asking-to-pad-but-the-tokenizer-does-not-have-a-padding-token
Most models in open source do not have pad token - new llama3.1, llama3, llama, allam, mixtral, mistral . Hence we have to add minimal 1 token for all these architectures, which we do in generic manner 'if pad token is None, set it'
This PR is thus doing the post-processing needed to handle addition of any token for LoRA inferencing on vLLM. Without this PR, LoRA inference on vLLM does not work for any of above architectures.
Even if we remove other tokens, like unk etc - which can be done in following PR and issue , the change is still needed for pad token. Hence better to keep the change generic and return number of added tokens from code
Description of the change
utility function to post process checkpoint after LoRA tuning to convert to format required by vLLM. This will need to be called at end of LoRA tuning to allow inferencing on LoRA, for models for which we have added new tokens.
Since it is fast enough to load adapters.safetensors , added it as a post-processing function.
This PR adds a script that can be called after tuning to do the processing.
Related issue number
https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/1210 Details on vLLM issue: https://github.com/vllm-project/vllm/issues/2816#issuecomment-1934607503
Context: Embedding vectors for new tokens need to be placed in
new_embeddings.safetensors
and lm_head.weight should be deleted fromadapter_model.safetensors
as per https://github.com/vllm-project/vllm/issues/2816#issuecomment-1934607503How to verify the PR
Was the PR tested