Description

Robustness testing aims to evaluate the ability of a model to maintain consistent performance when faced with various perturbations or modifications in the input data. For LLMs, this involves understanding how changes in capitalization, punctuation, typos, contractions, and contextual information affect their prediction performance.

Two-layer method where the comparison between the expected_result and actual_result is conducted

two_layer_evaluation

Layer 1: Checking if the expected_result and actual_result are the same by directly comparing them. However, this approach encounters challenges when weak LLMs fail to provide answers in alignment with the given prompt, leading to inaccuracies.
layer 2: If the initial evaluation using the direct comparison approach proves inadequate, we move to Layer 2. we provide three alternative options for evaluation: String distance, Embedding distance, or utilizing LLMs as evaluators.

This dual-layered approach enhances the robustness of our evaluation metric, allowing for adaptability in scenarios where direct comparisons may fall short.

➤ Fixes # (issue)

Type of change

Please delete options that are not relevant.

[ ] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] This change requires a documentation update

Usage

Checklist:

[ ] I've added Google style docstrings to my code.
[ ] I've used pydantic for typing when/where necessary.
[ ] I have linted my code
[ ] I have added tests to cover my changes.

JohnSnowLabs / langtest

Two layer evaluation #918

Description

Type of change

Usage

Checklist:

Screenshots (if appropriate):