Open iramykytyn opened 2 weeks ago
Minimizing Randomness is described on Google Docs
When working with large language models (LLMs) like GPT, a major challenge is ensuring consistency in responses. These models are inherently stochastic, meaning randomness is embedded in their behavior. However, in certain applications, such as testing or content generation where consistency is crucial, it becomes essential to minimize this variability. In this article, we'll explore how to control randomness using parameters like seeds and top_p
, and why monitoring the system_fingerprint
parameter is critical. We'll also discuss why achieving full determinism in LLM responses is ultimately impossible.
While setting a seed and controlling parameters like top_p
can help reduce randomness, true determinism in LLM responses is impossible to achieve for several reasons:
Non-deterministic Model Behavior: Even when top_p
is set to an extremely low value (e.g., top_p = 0.00000001
), which allows only the highest-ranking token to be selected, the model can still produce different outputs over time. This is due to underlying non-deterministic faults within the model. On longer text generations, even with such strict top_p
settings, the highest-ranked token may switch occasionally due to the complex vector math that happens before token sampling. This behavior comes from the probabilistic nature of neural networks, meaning that token choice can vary despite tight constraints.
System Fingerprint Variability: The system_fingerprint
parameter provides a unique identifier for the backend system serving the model. However, this fingerprint can change every time you call the model. When the fingerprint changes, it can indicate that the underlying API backend or model architecture has been updated. As a result, even with identical seeds and settings, slight changes in the model's internal mechanics can lead to different outputs. Thus, even with strong controls over randomness, true determinism remains out of reach.
One of the most effective ways to reduce variability is by setting a seed. This ensures that, for a given input, the model produces the same output each time. While this helps create repeatability, it’s important to remember that due to the non-deterministic nature of LLMs, even this won’t guarantee perfect consistency across different API calls or versions of the model (especially if the system_fingerprint
changes).
The top_p
parameter controls the probability distribution from which the model selects the next token. Setting top_p
to a lower value restricts the model to a smaller pool of possible token choices, narrowing down the output variability.
However, lowering top_p
too much can lead to unintended consequences:
Chain of Thought Resolution Issues: For complex tasks that require multi-step reasoning or explanations (e.g., chain of thought processes), setting top_p
too low (e.g., top_p = 0.0001
) can drastically degrade the model's ability to think through problems. The lack of diversity in token selection limits the model’s ability to explore nuanced responses, leading to overly deterministic but shallow outputs.
Performance Deterioration: Lowering top_p
to extreme levels can result in outputs that are either repetitive or overly simplistic, as the model is forced to choose from a very narrow set of tokens.
Both temperature and top_p control the randomness of the model’s responses, but in different ways:
Temperature: This parameter controls the randomness in token selection by scaling the probability distribution. Lower values (e.g., temperature = 0.1
) make the model more deterministic, while higher values (e.g., temperature = 1.0
) increase diversity.
Top_p: This parameter limits the set of tokens considered for selection based on their cumulative probability.
According to the documentation of both GPT and Claude, using temperature and top_p together can lead to unpredictable behavior and poor performance. When both are set simultaneously, the combined effect may confuse the model, leading to degraded reasoning abilities and lower-quality outputs. For example, using both parameters might significantly hamper tasks that require detailed reasoning, like multi-step logic or chain of thought processes.
When you set the parameter n and the GPT-powered chat generated three responses for a single input, the behavior was as follows:
Example with temp = 0, n = 3:
Example with temp = 0.15, n=3:
Unverified Theory : There is an unverified theory, which cannot be fully confirmed, that using OpenRouter introduces slightly more randomness than calling OpenAI directly. This may be due to the fact that OpenRouter uses different providers for LLM processing and different systems to handle heavy loads, whereas OpenAI may manage its infrastructure somewhat differently. This could contribute to additional variability in responses when using OpenRouter.
Minimizing randomness in GPT responses is crucial in scenarios where consistency is needed, but true determinism remains elusive due to the non-deterministic nature of the model and variability in the system_fingerprint. By carefully controlling temperature, seeds, and top_p settings, you can reduce randomness, though you must be cautious about over-constraining the model, as it may degrade performance in complex tasks. Monitoring system_fingerprint changes is also key to distinguishing random variation from API backend updates, helping you manage LLM behavior more effectively.
The “seed” option for GPT does not increase the determinism level
ChatCompletions are not deterministic even with seed set, temperature=0, top_p=0, n=1
left comments in article, please consider updating it
Try to minimize random in GPT responses with
seed
, ortop_p
parameters. Also try to monitor ifsystem_fingerprint
parameter changes and how often. Because if it changes when response changes it means that we deal with updated API backend, so probably it's not so random change in response, but LLM change which we should handle somehow differently.https://platform.openai.com/docs/api-reference/chat/create
See if there are some similar parameters in Claude API.
Report results in comments or create an article.