[EPIC] First Version of Wren AI Service Evaluation Framework

Context

In order to successfully deliver the great generative AI project, the necessity and huge impact of a robust evaluation system is self-evident. Without an useful evaluation system, we can't easily know how good or bad our system performs; also the evaluation process can't be automated.(Despite having evaluation system, human-in-the-loop is still needed; having one could definitely reduce lots of human effort). If you are curious about the topic, we've learned a lot from the community, and hope resources in the section of References might help you grasp the concept behind building one.

The evaluation framework is purposely built for WrenAI, and there will be more and more AI pipelines coming along the way. However, for the first version of our evaluation framework, we'll focus on the most important ai pipeline and the most used by users: ask pipeline, which is basically the text-to-sql task.

Goal

easier for non-technical users to curate evaluation dataset
metrics defined for the retrieval and generation stage, and also in terms of end-to-end pipeline
metrics defined for the component level, so that we can optimize the pipeline independently and in a more scalable way
pipeline tracing being saved for future easy debugging and refinement

Tasks

[x] #356
[x] #357
[x] #358
[x] #359
[x] #360

Canner / WrenAI