Open svenseeberg opened 2 months ago
Possible training data with questions about integreat content: https://huggingface.co/datasets/digitalfabrik/integreat-qa The questions are relatively simple and well phrased, so only cover a subsection of cases mentioned above.
Tests based on 9f57f80f68222daf3ac1ce088f727d8b00d92797 (llama3.1:8b, skip questions with no matching documents, chunking at h2 tags)
I need to know the German language for a job. What do I need to do?
Does not always yield a result. It seems that in 1 of 4 cases the message is not classified as a question that requires an answer.
Another interesting prompt:
Is there a cinema in Munich that shows English movies?
{
"answer": "I don't know. The provided context does not mention cinemas or movie showings in Munich.",
"sources": [
"/muenchen/en/culture-leisure-sport/general-information/",
"/muenchen/en/culture-leisure-sport/be-creative/youth-theatre-workshop-in-the-bellevue-di-monaco/",
"/muenchen/en/culture-leisure-sport/meet-people/meetings-in-the-neighbourhood/"
],
"details": [
{
"source": "/muenchen/en/culture-leisure-sport/be-creative/youth-theatre-workshop-in-the-bellevue-di-monaco/",
"score": 0.7928134202957153
},
{
"source": "/muenchen/en/culture-leisure-sport/general-information/",
"score": 0.855070948600769
},
{
"source": "/muenchen/en/culture-leisure-sport/meet-people/meetings-in-the-neighbourhood/",
"score": 1.0023198127746582
}
],
"status": "success"
}
Another test question with frequent bad results:
Hi I'm from Afghanistan and 17 years old. How can I learn German?
We tried to get more consistent documents from Milvus (see #60) with flat indexes but still got varying results. The only possible conclusion: the embedding model is producing different vectors for the same query.
*edit: see https://github.com/digitalfabrik/integreat-chat/issues/61#issuecomment-2431861775
Another observation: the chunking (and chunk encoding) might be problematic as well.
We want to do performance testing on our different modules:
3 of the above mentioned components should be fixed, while we change one of them and test different approaches with our benchmark questions.
Benchmark questions in order of their priority and based on our user stories:
Extended Benchmark questions based on Persona "Iryna"
Extended Benchmark questioins not based on Personas: