Robustness testing aims to evaluate the ability of a model to maintain consistent performance when faced with various perturbations or modifications in the input data. For LLMs, this involves understanding how changes in capitalization, punctuation, typos, contractions, and contextual information affect their prediction performance.
Two-layer method where the comparison between the expected_result and actual_result is conducted
Layer 1: Checking if the expected_result and actual_result are the same by directly comparing them.
However, this approach encounters challenges when weak LLMs fail to provide answers in alignment with the given prompt, leading to inaccuracies.
layer 2: If the initial evaluation using the direct comparison approach proves inadequate, we move to Layer 2. we provide three alternative options for evaluation: String distance, Embedding distance, or utilizing LLMs as evaluators.
This dual-layered approach enhances the robustness of our evaluation metric, allowing for adaptability in scenarios where direct comparisons may fall short.
➤ Fixes # (issue)
Type of change
Please delete options that are not relevant.
[ ] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] This change requires a documentation update
Usage
Checklist:
[ ] I've added Google style docstrings to my code.
[ ] I've used pydantic for typing when/where necessary.
Description
Robustness testing aims to evaluate the ability of a model to maintain consistent performance when faced with various perturbations or modifications in the input data. For LLMs, this involves understanding how changes in capitalization, punctuation, typos, contractions, and contextual information affect their prediction performance.
Two-layer method where the comparison between the expected_result and actual_result is conducted
Layer 1: Checking if the expected_result and actual_result are the same by directly comparing them. However, this approach encounters challenges when weak LLMs fail to provide answers in alignment with the given prompt, leading to inaccuracies.
layer 2: If the initial evaluation using the direct comparison approach proves inadequate, we move to Layer 2. we provide three alternative options for evaluation: String distance, Embedding distance, or utilizing LLMs as evaluators.
This dual-layered approach enhances the robustness of our evaluation metric, allowing for adaptability in scenarios where direct comparisons may fall short.
➤ Fixes # (issue)
Type of change
Please delete options that are not relevant.
Usage
Checklist:
pydantic
for typing when/where necessary.Screenshots (if appropriate):