Marker-Inc-Korea / RAGchain

Extension of Langchain for RAG. Easy benchmarking, multiple retrievals, reranker, time-aware RAG, and so on...
Apache License 2.0
275 stars 28 forks source link

refactoring answer metric(BLEU, METEOR, ROUGE... etc ) #398

Open Eastsidegunn opened 8 months ago

Eastsidegunn commented 8 months ago

We have few metric based on category of metric which can be controllable with few parameter (if based on n-gram, can choose n) more flexible!

in my opinion,

have more idea?

vkehfdl1 commented 8 months ago

It will give great flexibility of metrics, but I think it doesn't have to implement in RAGchain. There are mainly two reason.

  1. Each metrics might have conventional setting. We target RAG workflow and its researches, there must be conventional hyperparameter or setups to each metrics. Because all benchmark should perform in same setting, ideally. So, it is better that we suggest conventional setup for each metric.
  2. User can calculate score with their metrics after get their own result. We give whole pd.DataFrame that contain question, answer, gt answer, etc. User can easily calculate score with their own metric with this Dataframe. So, if someone wants to score their result with new metric, they can do that easily. (Maybe we can make guide for that later. I did once with Rare F1 metric.)

Plus, I think it will be too complicated to use our evaluator. Sometimes, framework should restrict flexibility for easy to use.

Eastsidegunn commented 8 months ago

Ok,

I think that adding EM(Exactly match) metric is one of defined step. (conclusion of surfing on many benchmark)

Actually, I can't perceive setup which I can suggest conventinally. BLEU, and ROUGE score is wrapped official(maybe...? basically used) library... they have many variation according to n or perspective. this problem clearly need to be solved by our evaluator

In my think, evaluator should open folloing functions

how about metric_expaneded-version..? like metric.py, metric_expanded-version is collection of various metrics that are not officially(?) accepted name is tentative