Add Prometheus Eval Client

conan1024hao commented 5 months ago

Adds the Prometheus Eval Client

It will not work without using prompts followed Prometheus templates.
similarity_scorer function is not implemented.
Inference under torch.bfloat16 requires cuda environment with 24GB or more VRAM.

conan1024hao commented 4 months ago

@liwii A small question, I found that "" and '' are mixed up in the code, is there any policies of it?

For example, '' is used in https://github.com/citadel-ai/langcheck/blob/c55681b08c33d6e6779a3b07f47ce86a0cc549cb/src/langcheck/metrics/eval_clients/_anthropic.py#L150, but "" is used in https://github.com/citadel-ai/langcheck/blob/c55681b08c33d6e6779a3b07f47ce86a0cc549cb/src/langcheck/metrics/eval_clients/_anthropic.py#L110.

liwii commented 4 months ago

Ah yeah it is just inconsistent haha

Our linter & formatter do not handle them properly now, but we'll fix that altogether in #125. You don't need to care too much about that in this PR!

conan1024hao commented 4 months ago

Note: if we want to ensure the outputs are parsable by sampling multiple times, maybe we need to adjust the temperature settings. Since the functions of generating text response (get_text_response()) and getting score (get_float_score) are separated, it might be necessary to redefine get_score() function like this:

    def get_score(
        self,
        metric_name: str,
        language: str,
        prompts: str | Iterable[str],
        score_map: dict[str, float],
        *,
        intermediate_tqdm_description: str | None = None,
        score_tqdm_description: str | None = None
    ) -> tuple[list[float | None], list[str | None]]:
        if isinstance(prompts, str):
            prompts = [prompts]
        unstructured_assessment_results = []
        scores = []
        for prompt in prompts:
            attempts = 0
            score = None
            # Retry if the score is None until the max_attempts.
            while attempts < self._max_attempts and score is None:
                unstructured_assessment_result = self.get_text_responses(
                    [prompt], tqdm_description=intermediate_tqdm_description)
                score = self.get_float_score(
                    metric_name,
                    language,
                    unstructured_assessment_result,
                    score_map,
                    tqdm_description=score_tqdm_description)[0]
                attempts += 1
        unstructured_assessment_results.append(
            unstructured_assessment_result[0])
        scores.append(score)
        return scores, unstructured_assessment_result

conan1024hao commented 4 months ago

@liwii Hi, I merged https://github.com/citadel-ai/langcheck/pull/126 to this one and

fixed the prompts' folder structure, to /metrics/prometheus
created a new function load_prompt_template to handle different prompts

Please take a review!

citadel-ai / langcheck

Add Prometheus Eval Client #122