cloudflare / pingora

A library for building fast, reliable and evolvable network services.
Apache License 2.0
20.28k stars 1.1k forks source link

Is the log level correct for ERROR 'failed to acquire reusable stream' #201

Closed xushichangdesmond closed 1 month ago

xushichangdesmond commented 2 months ago

Hi

See the error 'failed to acquire reusable stream' in pingora-core/src/connectors/mod.rs

    pub async fn reused_stream<P: Peer + Send + Sync>(&self, peer: &P) -> Option<Stream> {
        match self.connection_pool.get(&peer.reuse_hash()) {
            Some(s) => {
                debug!("find reusable stream, trying to acquire it");
                {
                    let _ = s.lock().await;
                } // wait for the idle poll to release it
                match Arc::try_unwrap(s) {
                    Ok(l) => {
                        let mut stream = l.into_inner();
                        // test_reusable_stream: we assume server would never actively send data
                        // first on an idle stream.
                        if peer.matches_fd(stream.id()) && test_reusable_stream(&mut stream) {
                            Some(stream)
                        } else {
                            None
                        }
                    }
                    Err(_) => {
                        error!("failed to acquire reusable stream");
                        None
                    }
                }
            }
            None => {
                debug!("No reusable connection found for {peer}");
                None
            }
        }
    }

If I understand correctly, this should be maybe logged on DEBUG level more appropriately instead of error, since it isn't really an error if the other end drops the connection every now and then if theres no inflight requests.

andrewhavck commented 2 months ago

Makes sense, feel free to create a PR.

xushichangdesmond commented 2 months ago

Sure, please check https://github.com/cloudflare/pingora/pull/205

andrewhavck commented 2 months ago

Sorry for the confusion, I misread above and that log is actually due to the Arc::try_unwrap as there is more than one strong reference to the Arc which shouldn't happen. This is technically a bug and that's why we log an error, it's worth investigating further though.

xushichangdesmond commented 2 months ago

ConnectionPool::idle_poll(...) owns the lock via a OwnedMutexGuard, which holds the same arc reference of topic here(whose try_unwrap is failing) via OwnedMutexGuard::lock

after idle_poll exits, that guard is dropped and the lock released, but it is not clear to me that OwnedMutexLock::lock is released before or in atomic fashion with the release of that lock as in reusedstream(...)'s ```let = s.lock().await;```

If this sounds possible, maybe we should try to reproduce this in a small test case just using the Arc<Mutex<()>> pattern together with an async use of the mutex's try_lock_owned() or otherwise

andrewhavck commented 2 months ago

Have you been able to reproduce this issue?

andrewhavck commented 1 month ago

Closing this issue as the log level seems appropriate and we're only seeing this log on machines that are in a bad state. If you're able to consistently repro please open a new issue, thanks.

Object905 commented 3 weeks ago

What are possible cause of this? I'm getting this consistently with these peer options

let mut result = PeerOptions::new();
result.tcp_keepalive = Some(TcpKeepalive {
    count: 3,
    idle: Duration::from_secs(60),
    interval: Duration::from_secs(20),
});
result.total_connection_timeout = Some(Duration::from_secs(10));
result.idle_timeout = Some(Duration::from_secs(60));
result.read_timeout = Some(Duration::from_secs(15));
result.write_timeout = Some(Duration::from_secs(15));
result.alpn = ALPN::H2H1;
result.max_h2_streams = 1;
result.h2_ping_interval = None;
result