[Request Feedback]: TR-6, Non-deterministic behaviour

Contact Details

vicente.herrera@control-plane.io

What is the idea

This is a request for direct feedback and answers to the following question regarding a threat described in the governance framework.

TR-6, Non-deterministic behaviour Section: governance-framework/_threats/tr-6.md File (private repo link): https://github.com/finos/ai-readiness-private/blob/10e31ea7ccf3893983404de3484f7c57f9934d57/governance-framework/_threats/tr-6.md Diff (private repo link): https://github.com/finos/ai-readiness-private/pull/5/files#diff-87d41e3ec96687ff49345546307c878da1b45b272916021f5446d007c681abbe

Title: Non-deterministic behaviour Type: Integrity External references:

owasp-llm-top-10
- LLM09: Overreliance
- LLM08: Excessive Agency

Description: Given the immaturity of the products, the vector store may not have capabilities expected of enterprise software (access control, encryption at rest, audit logging etc). Misconfiguration may allow unauthorized access to data. Internal user accesses data and leaks/tampers it.

Question:

A fundamental property of LLMs is their non-determinism of response. This can make it difficult or impossible to reproduce inference results, and may result in different (and potentially incorrect) results being returned occasionally and in a hard to predict manner.

Why this is important

Discussion surrounding this threat has been going on live, on today's meeting (2024-09-03).

We have opened this issue to better capture's everyone's feedback, and not being constraint by the time alloted to the meeting.

Code of Conduct

[X] I agree to follow the FINOS Code of Conduct

To address the concern around the non-determinism of LLMs, one approach is to develop a continuous testing module that regularly checks the correctness of the model's output by establishing a set of benchmark question-answer pairs. Here's how you could approach this:

Test Module Setup: Question Bank: Create a repository of test questions that represent different domains, difficulty levels, and edge cases relevant to the LLM’s use cases. Expected Answers: For each question, establish an expected answer that is considered correct or within an acceptable range. Threshold Definition: Since LLMs may provide slightly different wordings or nuances, define a correctness threshold. This could involve keyword matching, similarity scoring (e.g., cosine similarity between vectors), or pre-defined acceptable variations of answers.
Regular Testing: Scheduled Tests: The module should run scheduled tests (e.g., daily or hourly) where the LLM generates answers for the question bank. Comparison to Expected Outputs: Each generated response is compared to the expected answer using a mix of lexical comparison (exact or partial matching), semantic comparison (vector-based), and human-in-the-loop validation when required.
Non-determinism Handling: Multi-Run Validation: Execute the same question multiple times across different time intervals to evaluate consistency in the LLM’s responses. Response Variation Logging: Log the variance between responses. If the differences exceed an acceptable range or lead to incorrect outcomes, flag the issue for further review.
Adaptive Testing: Dynamic Question Bank: Periodically update the question bank based on newly identified risks, patterns of failure, or feedback from end users. Error Learning: If incorrect results are returned, the module should allow feedback loops to adjust or inform model retraining.
Error Analysis and Reporting: Metrics Dashboard: Maintain a dashboard that tracks the correctness, response variation, and percentage of acceptable answers over time. Alerts: Set thresholds for model accuracy, and trigger alerts if these thresholds are breached. This testing module will act as a guardrail, continuously monitoring for errors and non-deterministic behavior while providing real-time feedback on the model's correctness. This process ensures that any potential degradation in performance is detected early and addressed promptly.

finos / ai-readiness