Open VrushaliJoshi-v37040 opened 1 week ago
Could you provide your deployment config? Trying to help here. Logs will also help
I had used a serving.properties file which has the following configurations 1 engine=MPI 2 option.task=text-generation 3 option.trust_remote_code=true 4 option.tensor_parallel_degree=1 5 option.model_id={{model_id}} 6 option.dtype=fp16 7 option.tgi_compat=true 8 option.rolling_batch=lmi-dist
My endpoint config is very simple: { "VariantName": "variant1", "ModelName": model_name, "InstanceType": "ml.g5.24xlarge", "InitialInstanceCount": 1, "ModelDataDownloadTimeoutInSeconds": 3600, "ContainerStartupHealthCheckTimeoutInSeconds": 3600, } Also please note here, I am not facing any errors while deploying, the deployment is successful but the output formats are different. Expected output according to the DJL documentation for TGI compatible output feature: [ { "generated_text": "Deep Learning is a really cool field" } ]
What I am getting: { "generated_text": "Deep Learning is a really cool field" }
ALso the quality of output degraded significantly with DJL container as compared to TGI container
could you share a sample prompt you use and parameters? And exepcted output if possbile?
I have mentioned the sample prompt in the issue description. Mentioning below again for reference: """[INST] <> Respond only with the answer and do not provide any explanation or additional text. If you don't know the answer to a question, please answer with 'I dont know'.Answer should be as short as possible. <> Below context is text extracted from a medical document. Answer the question asked based on the context given. Context: {text} Question: {question} [/INST]"""
Expected output if question is: What is patient name ? Model response : [{'generated_text : 'John H'}]
I am using a fine tuned model which is trained on the above mentioned format of prompt and answer
I have a finetuned llama 2 7B chat model which I am deploying to an endpoint using DJL container. After deploying when I tested the model, the model output quality has degraded (The output seems to be echoing same answer for some questions asked).
Before using DJL container, I was using TGI container and the model was working absolutely fine. I understand there could be difference in the way of inferencing for both these containers but is there a way of overriding the inference code. Following is the sample prompt that I am using to prompt the model: "[INST] <>
Respond only with the answer and do not provide any explanation or additional text. If you don't know the answer to a question, please answer with 'I dont know'.Answer should be as short as possible.
< >
Below context is text extracted from a medical document. Answer the question asked based on the context given.
Context: {text}
Question: {question} [/INST]"
The model is finetuned on the above mentioned prompt so we need to inference in such a way that it comprehends this format of the prompt and gives the answer.
Any resources/suggestions would be really helpful.