algesten / ureq

A simple, safe HTTP client
Apache License 2.0
1.65k stars 171 forks source link

OS Error 11 (EAGAIN) from inside TLS #766

Closed John-Nagle closed 2 weeks ago

John-Nagle commented 1 month ago

HTTPS GET failed with the error return:

Caused by:
    https://simhost-0c7654194ce7142bc.agni.secondlife.io:12043/cap/fa9fd2b1-f800-f0b8-8118-e71d27079306:
   Connection Failed: tls connection init failed: Resource temporarily unavailable (os error 11)

That's an EAGAIN error. That's strange to see from Ureq. That's synonymous with EWOULDBLOCK, used by async requests to report that there's no data available right now, so try again later.

But Ureq is synchronous. It shouldn't return that error. A timeout might be returned, but this error status is wrong.

It's non-fatal, so there's no backtrace.

This is an intermittent error. The code has been making these requests successfully for months.

My code treats this as a non-retryable error. It should probably retry. But how did this make it out to the user?

ureq = "2.9.1" Linux 22.04 LTS

algesten commented 1 month ago

Hi @John-Nagle!

ureq does use set_nonblocking() temporarily when it wants to check that a pooled connection to a server is still fresh. The way it does it is:

  1. set_nonblocking(true)
  2. Try read.
  3. set_nonblocking(false)

The idea is that in the happy path, the read should result in a EWOULDBLOCK indicating the connection is still open and can be used. A closed connection would give us another error, and if we manage to read any bytes, the server sent stuff we didn't expect (meaning it's best to close since otherwise we risk request smuggling etc).

I wonder if what you observe maybe has something to do with this?

John-Nagle commented 1 month ago

Possibly. Ureq can usefully use nonblocking I/O at some points. But status EWOULDBLOCK should never make it out to the caller.

For now, I turned on a retry loop which, on any error, waits 1 second and tries the connection again, for 5 tries. That will probably work around intermittent errors.

jsha commented 1 month ago

Could you share the code that produces this error and a general description of the traffic pattern? For instance, does the server close the connection after a certain period of time?

I'm curious about the "tls connection init failed" because in theory we should only be pooling connections that are already initialized and have had a request/response pair sent on them.

John-Nagle commented 1 month ago
pub fn fetch_capabilities(
    client: &HttpClient,
    cap_url: &str,
    keys: &[&str],
) -> Result<HashMap<String, String>, Error> {
    const RETRIES: usize = 0; // try this many times

    //  Construct request in LLSD format
    let query = compose_query(keys).with_context(|| {
        format!(
            "Composing query for capabilities {:?} from {}",
            keys, cap_url
        )
    })?;
    //  Retry query. Especially on Open Simulator, tries can fail.
    //  Documentation says that firewall/CDN problems can cause transient 5xx errors.
    let mut try_count = 0;
    let reply = loop {
        log::info!("Capability query to {}: {}", cap_url, query);
        //  Send HTTP request and block.
        let http_result = client
            .agent
            .post(cap_url)
            ////.set("Content-Type", "text/xml") // "text/xml; charset=utf-8" is rejected by SL server
            .set("Content-Type", "application/llsd+xml") // "text/xml; charset=utf-8" is rejected by SL server
            .send_string(&query)
            .map_err(anyhow::Error::msg)
            .with_context(|| format!("Fetching capabilities {:?} from {}", keys, cap_url));
        //  Analyze result
        match http_result {
            Ok(reply) => break reply, // good
            Err(e) => {
                log::error!(
                    "Error reading capability {}, try {}: {:?}",
                    cap_url,
                    try_count,
                    e
                );
                try_count += 1;
                if try_count > RETRIES {
                    return Err(e);
                }
                std::thread::sleep(std::time::Duration::from_millis(1000)); // wait before retry
            }
        }
    };
   ...

This was running with RETRIES=0, so any failure was fatal. It's now running with RETRIES=5.

This is deep inside a metaverse client for Second Life/Open Simulator. The servers provide a URL which is queried for some internal data. This particular code is executed a few times per minute, maximum. As the player moves around, this is part of the system which attaches new regions. So it's not heavy traffic. The server side is 20 year old code, and may be misbehaving. The client code above has been unchanged for the last year.

algesten commented 2 weeks ago

Closing since we're moving to ureq 3.x. The behavior is changing.