12th of July Updates - Githubissues

Datawheel / template-chatbot

Template repository for a chatbot instance

MIT License

0 stars 1 forks source link

RAG Evaluation

100 questions Types of questions:
- 60 on general trade
- 12 on growth/variation
- 28 on rankings
RAG evaluation results

Best combination tested so far: multi-qa-mpnet-base-cos-v1 (embeddings) + gpt-3.5-turbo (LLM)

Accuracy: 73%
- Answers missing data: 9
- Answers missing context: 14
- Incorrect answers: 4
Average latency: 4.19 s

Out of the wrong answers:

24 were general questions
3 of growth (the lowest)
0 of ranking questions

We're preparing a presentation gathering the results of all approaches with more detail. Next week I'll be improving the RAG + LLM and evaluating the previous multi-layer approach with Pippo.

Fine-tune MAPE results:

Recap: Accuracy is 0% for all fine-tuned versions because numbers aren't right
Measuring the absolute percentage error by model

Measuring the median absolute % error on value qty, taking the 1.1B params model with one epoch as baseline (ft_tiny0), the error decreased on the 50 epoch trained model (ft_tiny2) and increased when increasing model size (to 7B params, ft_llama2). Sort of make sense, larger models are harder to finetune.

Fine tuning tinyLlama to produce API calls instead, the accuracy is 12%, many because it struggles with HS code numbers. Other than that, query accuracy is 89%. On the HS numbers the % mean error is about 12%. but again every time the model is query with the same question it returns a slightly different number.

Datawheel / template-chatbot

12th of July Updates #7