Llama 2 7b chat model output quality is low

VrushaliJoshi-v37040 commented 1 week ago

I have a finetuned llama 2 7B chat model which I am deploying to an endpoint using DJL container. After deploying when I tested the model, the model output quality has degraded (The output seems to be echoing same answer for some questions asked).

Before using DJL container, I was using TGI container and the model was working absolutely fine. I understand there could be difference in the way of inferencing for both these containers but is there a way of overriding the inference code. Following is the sample prompt that I am using to prompt the model: "[INST] <> Respond only with the answer and do not provide any explanation or additional text. If you don't know the answer to a question, please answer with 'I dont know'.Answer should be as short as possible. <> Below context is text extracted from a medical document. Answer the question asked based on the context given. Context: {text} Question: {question} [/INST]"

The model is finetuned on the above mentioned prompt so we need to inference in such a way that it comprehends this format of the prompt and gives the answer.

Any resources/suggestions would be really helpful.

lanking520 commented 1 week ago

Could you provide your deployment config? Trying to help here. Logs will also help

VrushaliJoshi-v37040 commented 5 days ago

I had used a serving.properties file which has the following configurations 1 engine=MPI 2 option.task=text-generation 3 option.trust_remote_code=true 4 option.tensor_parallel_degree=1 5 option.model_id={{model_id}} 6 option.dtype=fp16 7 option.tgi_compat=true 8 option.rolling_batch=lmi-dist

My endpoint config is very simple: { "VariantName": "variant1", "ModelName": model_name, "InstanceType": "ml.g5.24xlarge", "InitialInstanceCount": 1, "ModelDataDownloadTimeoutInSeconds": 3600, "ContainerStartupHealthCheckTimeoutInSeconds": 3600, } Also please note here, I am not facing any errors while deploying, the deployment is successful but the output formats are different. Expected output according to the DJL documentation for TGI compatible output feature: [ { "generated_text": "Deep Learning is a really cool field" } ]

What I am getting: { "generated_text": "Deep Learning is a really cool field" }

ALso the quality of output degraded significantly with DJL container as compared to TGI container

lanking520 commented 4 days ago

could you share a sample prompt you use and parameters? And exepcted output if possbile?

VrushaliJoshi-v37040 commented 4 days ago

I have mentioned the sample prompt in the issue description. Mentioning below again for reference: """[INST] <> Respond only with the answer and do not provide any explanation or additional text. If you don't know the answer to a question, please answer with 'I dont know'.Answer should be as short as possible. <> Below context is text extracted from a medical document. Answer the question asked based on the context given. Context: {text} Question: {question} [/INST]"""

Expected output if question is: What is patient name ? Model response : [{'generated_text : 'John H'}]

I am using a fine tuned model which is trained on the above mentioned format of prompt and answer

deepjavalibrary / djl-serving

Llama 2 7b chat model output quality is low #2093