Open alebjanes opened 4 months ago
Fine-tune MAPE results:
Measuring the median absolute % error on value qty, taking the 1.1B params model with one epoch as baseline (ft_tiny0), the error decreased on the 50 epoch trained model (ft_tiny2) and increased when increasing model size (to 7B params, ft_llama2). Sort of make sense, larger models are harder to finetune.
Fine tuning tinyLlama to produce API calls instead, the accuracy is 12%, many because it struggles with HS code numbers. Other than that, query accuracy is 89%. On the HS numbers the % mean error is about 12%. but again every time the model is query with the same question it returns a slightly different number.
RAG Evaluation
100 questions Types of questions:
RAG evaluation results
Best combination tested so far: multi-qa-mpnet-base-cos-v1 (embeddings) + gpt-3.5-turbo (LLM)
Out of the wrong answers:
We're preparing a presentation gathering the results of all approaches with more detail. Next week I'll be improving the RAG + LLM and evaluating the previous multi-layer approach with Pippo.