RFC: Implement Resilient Error Handling for API Overload Conditions

Background

Currently, the sampling loop is vulnerable to API overload conditions (error 529), which can result in frequent failures and noisy error reporting. This impacts both reliability and user experience.

Problem

When encountering error 529 (overloaded_error), the current implementation:

Fails immediately without retrying
Generates noisy error logs
Doesn't implement any backoff strategy
May contribute to the overload condition through immediate retries

Proposed Solution

Implement a resilient error handling strategy with:

Exponential Backoff
- Start with 1 second base delay
- Exponential increase (1s, 2s, 4s, 8s, 16s)
- Maximum of 5 retries
- Cap maximum delay at 5 minutes

Smart Retry Logic

max_retries = 5
base_delay = 1  # Start with 1 second delay

for attempt in range(max_retries):
   try:
       raw_response = client.beta.messages.with_raw_response.create(...)
       break  # Success, exit retry loop
   except InternalServerError as e:
       if "overloaded_error" in str(e):
           if attempt == max_retries - 1:
               return messages  # Final attempt failed

           delay = min(300, base_delay * (2 ** attempt))
           jitter = random.uniform(0, 0.1 * delay)  # 10% jitter
           sleep_time = delay + jitter
           time.sleep(sleep_time)
           continue

Improved Error Handling
- Specific handling for overloaded vs other internal server errors
- Better error attribute handling using getattr()
- Cleaner error reporting structure

Benefits

Improved Reliability
- Higher success rate during high load periods
- Automatic recovery from temporary overload conditions
- Reduced impact on the API service through smart backoff
Better User Experience
- Less noisy error reporting
- More predictable behavior
- Transparent retry process
System Health
- Reduced load on API during stress periods
- Better alignment with best practices for API consumption
- More maintainable error handling code

Questions for Discussion

Are the retry parameters (max retries, delays) appropriate?
Should we add logging for retry attempts?
Should we consider implementing circuit breaker pattern for sustained outages?
Do we need to handle other error codes similarly?

Implementation Details

The implementation requires:

Adding imports: random, time
Modifying the sampling loop to include retry logic
Updating error handling structure
Adding appropriate type hints and documentation

Alternatives Considered

Client-side rate limiting: Rejected as it doesn't handle dynamic server conditions
Fixed retry delay: Rejected as it doesn't scale well with varying load
Infinite retries: Rejected to prevent hanging in case of sustained issues

Testing Strategy

Unit tests for backoff calculation
Integration tests with mocked 529 responses
Load testing to verify behavior under stress

Migration Plan

This change is backward compatible and can be rolled out directly as it only affects error handling behavior.

defrecord / anthropic-quickstarts

RFC: Implement Resilient Error Handling for API Overload Conditions #5