iramykytyn commented 2 weeks ago

Try to minimize random in GPT responses with seed, or top_p parameters. Also try to monitor if system_fingerprint parameter changes and how often. Because if it changes when response changes it means that we deal with updated API backend, so probably it's not so random change in response, but LLM change which we should handle somehow differently.

https://platform.openai.com/docs/api-reference/chat/create

See if there are some similar parameters in Claude API.

Report results in comments or create an article.

eLQeR commented 1 week ago

Minimizing Randomness is described on Google Docs

Minimizing Randomness in GPT Responses: A Guide to Using Seeds, Top_p, and Monitoring System Fingerprints

When working with large language models (LLMs) like GPT, a major challenge is ensuring consistency in responses. These models are inherently stochastic, meaning randomness is embedded in their behavior. However, in certain applications, such as testing or content generation where consistency is crucial, it becomes essential to minimize this variability. In this article, we'll explore how to control randomness using parameters like seeds and top_p, and why monitoring the system_fingerprint parameter is critical. We'll also discuss why achieving full determinism in LLM responses is ultimately impossible.

1. Why Full Determinism Is Impossible

While setting a seed and controlling parameters like top_p can help reduce randomness, true determinism in LLM responses is impossible to achieve for several reasons:

Non-deterministic Model Behavior: Even when top_p is set to an extremely low value (e.g., top_p = 0.00000001), which allows only the highest-ranking token to be selected, the model can still produce different outputs over time. This is due to underlying non-deterministic faults within the model. On longer text generations, even with such strict top_p settings, the highest-ranked token may switch occasionally due to the complex vector math that happens before token sampling. This behavior comes from the probabilistic nature of neural networks, meaning that token choice can vary despite tight constraints.
System Fingerprint Variability: The system_fingerprint parameter provides a unique identifier for the backend system serving the model. However, this fingerprint can change every time you call the model. When the fingerprint changes, it can indicate that the underlying API backend or model architecture has been updated. As a result, even with identical seeds and settings, slight changes in the model's internal mechanics can lead to different outputs. Thus, even with strong controls over randomness, true determinism remains out of reach.

2. Using a Seed to Control Randomness

One of the most effective ways to reduce variability is by setting a seed. This ensures that, for a given input, the model produces the same output each time. While this helps create repeatability, it’s important to remember that due to the non-deterministic nature of LLMs, even this won’t guarantee perfect consistency across different API calls or versions of the model (especially if the system_fingerprint changes).

3. Controlling Output Diversity with Top_p

The top_p parameter controls the probability distribution from which the model selects the next token. Setting top_p to a lower value restricts the model to a smaller pool of possible token choices, narrowing down the output variability.

However, lowering top_p too much can lead to unintended consequences:

Chain of Thought Resolution Issues: For complex tasks that require multi-step reasoning or explanations (e.g., chain of thought processes), setting top_p too low (e.g., top_p = 0.0001) can drastically degrade the model's ability to think through problems. The lack of diversity in token selection limits the model’s ability to explore nuanced responses, leading to overly deterministic but shallow outputs.
Performance Deterioration: Lowering top_p to extreme levels can result in outputs that are either repetitive or overly simplistic, as the model is forced to choose from a very narrow set of tokens.

4. Temperature and Top_p: Avoid Using Together

Both temperature and top_p control the randomness of the model’s responses, but in different ways:

Temperature: This parameter controls the randomness in token selection by scaling the probability distribution. Lower values (e.g., temperature = 0.1) make the model more deterministic, while higher values (e.g., temperature = 1.0) increase diversity.
Top_p: This parameter limits the set of tokens considered for selection based on their cumulative probability.

According to the documentation of both GPT and Claude, using temperature and top_p together can lead to unpredictable behavior and poor performance. When both are set simultaneously, the combined effect may confuse the model, leading to degraded reasoning abilities and lower-quality outputs. For example, using both parameters might significantly hamper tasks that require detailed reasoning, like multi-step logic or chain of thought processes.

5. Additional Information on Temperature and n Parameter

When you set the parameter n and the GPT-powered chat generated three responses for a single input, the behavior was as follows:

With the temperature parameter set to 0, all three responses were identical. This is because the GPT model generated all three responses in the same system state with a constant system fingerprint. However, when you restarted the LLM call, the response would change due to a different fingerprint.
On the other hand, when you set the temperature parameter to 0.15 and the n parameter to 3, the three generated responses for the same input were different. This indicates that with a constant system fingerprint, the temperature or top_p parameter is the only factor affecting the diversity of the generated responses.

Example with temp = 0, n = 3:

Example with temp = 0.15, n=3:

Unverified Theory : There is an unverified theory, which cannot be fully confirmed, that using OpenRouter introduces slightly more randomness than calling OpenAI directly. This may be due to the fact that OpenRouter uses different providers for LLM processing and different systems to handle heavy loads, whereas OpenAI may manage its infrastructure somewhat differently. This could contribute to additional variability in responses when using OpenRouter.

Conclusion

Minimizing randomness in GPT responses is crucial in scenarios where consistency is needed, but true determinism remains elusive due to the non-deterministic nature of the model and variability in the system_fingerprint. By carefully controlling temperature, seeds, and top_p settings, you can reduce randomness, though you must be cautious about over-constraining the model, as it may degrade performance in complex tasks. Monitoring system_fingerprint changes is also key to distinguishing random variation from API backend updates, helping you manage LLM behavior more effectively.

References

The “seed” option for GPT does not increase the determinism level

ChatCompletions are not deterministic even with seed set, temperature=0, top_p=0, n=1

iramykytyn commented 4 days ago

left comments in article, please consider updating it

COXIT-CO / dont_trust_ai

Investigate `seed` and `top_p` parameters in OpenAI /chat/completion endpoint to minimise randomnes. #6