Byron / google-apis-rs

A binding and CLI generator for all Google APIs
http://byron.github.io/google-apis-rs
Other
1.01k stars 131 forks source link

Process stuck, frozen on `CalendarHub.events().doit()` with high CPU usage #402

Closed fnune closed 1 year ago

fnune commented 1 year ago

Hi,

To start off, thank you very much for your libraries!

I'm running into an issue. I have a program that tries to fetch calendar events periodically. It runs just fine for a while, but at some point (after some 15 minutes) it will always end up freezing the process here:

block_on(
    self.hub
        .events()
        .list("primary")
        .add_scope(Scope::EventReadonly)
        .max_results(10)
        .time_min(&time_min.to_rfc3339())
        .time_max(&time_max.to_rfc3339())
        .single_events(true)
        .order_by("starttime")
        .doit(), // here!
)
.map(|(response, events)| {
    // this scope is never reached
})
.map_err(|_| EventsProviderError {})

This is the last log lines I get with RUST_LOG=debug before the process gets stuck:

[2022-12-21T21:56:26Z DEBUG yup_oauth2::authenticator] access token requested for scopes: [https://www.googleapis.com/auth/calendar.events.readonly]
[2022-12-21T21:56:26Z DEBUG yup_oauth2::authenticator] found valid token in cache: TokenInfo { access_token: "redacted", refresh_token: Some("redacted"), expires_at: Some(2022-12-21 22:28:22.147643846 +00:00:00), id_token: None }

The process hangs indefinitely (I've left it on for hours) and it seems to kick up CPU usage a lot. CPU usage goes up to 12% every time this happens, and remains there. 12% is more or less 100/8 (I have 8 CPUs, so that would indicate 100% usage of one CPU).

I've tried to wrap the future in a tokio::time::timeout_at, but to my great surprise the timeout never goes off.

I ran strace with my program, and the last syscall is futex:

futex(0x55a7020ddf00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55a70208a118, FUTEX_WAKE_PRIVATE, 1) = 1
recvmsg(9, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}], 3, 936) = 0 (Timeout)
statx(AT_FDCWD, "/etc/localtime", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFLNK|0777, stx_size=33, ...}) = 0
futex(0x7f52c0d955e0, FUTEX_WAKE_PRIVATE, 1) = 1

All the lines from strace happen during normal functioning, though.

As a last resort, I tried to get a coredump from the running process, but the process ignored all the signals except SIGKILL, which sounds like it's waiting for IO:

image

I wonder whether you have any ideas as to what might be going on. Maybe I need to set a timeout somewhere? Or maybe this library needs to do that?

Thank you in advance!

fnune commented 1 year ago

Noticing that when this happens, the process goes from interruptible sleep to running (Sl+ to Rl+):

fausto    408829  0.5  0.1 967872 37708 pts/2    Sl+  23:49   0:01 ./target/debug/joinable
 ~/Development/joinable => ps aux | grep 408829 | grep -v grep
fausto    408829  0.5  0.1 967872 37708 pts/2    Sl+  23:49   0:01 ./target/debug/joinable
 ~/Development/joinable => ps aux | grep 408829 | grep -v grep
fausto    408829  4.1  0.1 967872 37708 pts/2    Rl+  23:49   0:15 ./target/debug/joinable

So maybe it's not waiting for IO anymore.

fnune commented 1 year ago

Got a backtrace from gdb. Looks like the process is stuck between these two and goes into an infinite loop there:

futures-executor-0.3.25/src/local_pool.rs:95
std/src/thread/local.rs:446

Full trace:

(gdb) backtrace

#0 futures_executor::local_pool::run_executor::{closure#0}<core::result::Result<(http::response::Response<hyper::body::body::Body>,
google_calendar3::api::Events), google_calendar3::client::Error>,
futures_executor::local_pool::block_on::{closure_env#0}<core::future::from_generator::GenFuture<google_calendar3::api::{impl#133}::doit::{async_fn_env#0}<hyper_rustls::connector::HttpsConnector<hyper::client::connect::http::HttpConnector<hyper::client::connect::dns::GaiResolver>>>>>>
(thread_notify=0x7f1e0b9afc30) at
/home/fausto/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-executor-0.3.25/src/local_pool.rs:95

#1  0x000055b4bc25bbb1 in
std::thread::local::LocalKey<alloc::sync::Arc<futures_executor::local_pool::ThreadNotify>>::try_with<alloc::sync::Arc<futures_executor::local_pool::ThreadNotify>,
futures_executor::local_pool::run_executor::{closure_env#0}<core::result::Result<(http::response::Response<hyper::body::body::Body>,
google_calendar3::api::Events), google_calendar3::client::Error>,
futures_executor::local_pool::block_on::{closure_env#0}<core::future::from_generator::GenFuture<google_calendar3::api::{impl#133}::doit::{async_fn_env#0}<hyper_rustls::connector::HttpsConnector<hyper::client::connect::http::HttpConnector<hyper::client::connect::dns::GaiResolver>>>>>>,
core::result::Result<(http::response::Response<hyper::body::body::Body>,
google_calendar3::api::Events), google_calendar3::client::Error>>
(self=0x55b4bcf505a8, f=...) at
/rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/std/src/thread/local.rs:446

#2  0x000055b4bc25b3aa in
std::thread::local::LocalKey<alloc::sync::Arc<futures_executor::local_pool::ThreadNotify>>::with<alloc::sync::Arc<futures_executor::local_pool::ThreadNotify>,
futures_executor::local_pool::run_executor::{closure_env#0}<core::result::Result<(http::response::Response<hyper::body::body::Body>,
google_calendar3::api::Events), google_calendar3::client::Error>,
futures_executor::local_pool::block_on::{closure_env#0}<core::future::from_generator::GenFuture<google_calendar3::api::{impl#133}::doit::{async_fn_env#0}<hyper_rustls::connector::HttpsConnector<hyper::client::connect::http::HttpConnector<hyper::client::connect::dns::GaiResolver>>>>>>,
core::result::Result<(http::response::Response<hyper::body::body::Body>,
google_calendar3::api::Events), google_calendar3::client::Error>>
(self=0x55b4bcf505a8, f=...) thread/local.rs:422

#3  0x000055b4bc126bbf in
futures_executor::local_pool::run_executor<core::result::Result<(http::response::Response<hyper::body::body::Body>,
google_calendar3::api::Events), google_calendar3::client::Error>,
futures_executor::local_pool::block_on::{closure_env#0}<core::future::from_generator::GenFuture<google_calendar3::api::{impl#133}::doit::{async_fn_env#0}<hyper_rustls::connector::HttpsConnector<hyper::client::connect::http::HttpConnector<hyper::client::connect::dns::GaiResolver>>>>>>
(f=...) at
/home/fausto/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-executor-0.3.25/src/local_pool.rs:86

#4  0x000055b4bc1270ce in
futures_executor::local_pool::block_on<core::future::from_generator::GenFuture<google_calendar3::api::{impl#133}::doit::{async_fn_env#0}<hyper_rustls::connector::HttpsConnector<hyper::client::connect::http::HttpConnector<hyper::client::connect::dns::GaiResolver>>>>>
(f=...) at
/home/fausto/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-executor-0.3.25/src/local_pool.rs:317

#5  0x000055b4bc091ccf in
joinable::events::providers::google_calendar::{impl#2}::fetch_events
(self=0x7f1df8001478) at src/events/providers/google_calendar.rs:87
Byron commented 1 year ago

Thanks for posting and researching the issue this thoroughly!

It's unclear to me whcih version of the crate is being run here, and the only suggestion I can make is to build it from the source found in this repository to allow altering the code and using the very latest of it. Maybe try cargo update to fetch the latest version of all dependencies, or try cargo upgrades to see what's out of date. Maybe that fixes the issue.

As it stops to work after 15 minutes it seems a bit like it's linked to the validity of the token. Maybe it tries to refresh it and something bad happens?

fnune commented 1 year ago

As it stops to work after 15 minutes it seems a bit like it's linked to the validity of the token.

I thought so, too, but the app freezes before expires_at all of the time. It's not exactly 15 minutes, that was just a gut feeling figure (sorry about that).

fnune commented 1 year ago

Here's the snippet from Cargo.lock that I'm running after updating:

[[package]]
name = "google-calendar3"
version = "4.0.1+20220217"

The problem reproduces.

fnune commented 1 year ago

I ran cargo vendor and am going to try hacking at the google-calendar3 dependency.

fnune commented 1 year ago

Looks like the response just never comes back and the program doesn't go further than the await here:

2022-12-22_20-31-53_screenshot

That future is then polled forever really fast, so CPU usage goes up a lot.

I couldn't find that line in the generated code, btw.

fnune commented 1 year ago

Something I first noticed now: after a while this shows up in the output:

[2022-12-22T19:31:04Z DEBUG rustls::conn] Sending warning alert CloseNotify
fnune commented 1 year ago

I tried adding a timeout and it's not timing out:

timeout(Duration::from_secs(10), client.request(request.unwrap())).await.unwrap()
Byron commented 1 year ago

Something seems broken somewhere in the reqwest stack, and that gives me quite a strange feeling. After all, I don't understand how this stack works and when it bites like it does now, I'd be quite lost.

One more things I'd try is to use the tokio-console to see more clearly which request is failing maybe. Maybe that reveals something new which could lead to a solution. If not, I am out of ideas here as well unfortunately.

fnune commented 1 year ago

Hello again and happy new year :smile: although it's a little late already. I'm back at it, trying to figure out what's wrong.

I used lsof to try to find the IP/port where the app is making a connection:

$ lsof -i -a -p $(pidof joinable)
COMMAND   PID   USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
joinable 9449 fausto   16u  IPv4  66942      0t0  TCP feanor:55974->ham11s07-in-f10.1e100.net:https (ESTABLISHED)

Then I found packets going through 55974 in Wireshark and selected the corresponding TCP stream:

image

My program was polling every 10 seconds for testing, so (1) and (2) show successful requests and all their packets.

Section (3) shows the closing of the TCP connection when I realized the program got stuck and no further packets were received.

So the finding is: the Rust program is stuck before the response packets arrive.

fnune commented 1 year ago

Time for tokio-console...

fnune commented 1 year ago

I'm half hoping/fearing that it's something very silly :(

fnune commented 1 year ago

In the resources view I see this:

image

│   110 n/a     Sync   198.8622s tokio::sync::oneshot         Sender|Receiver ✅   <cargo>/hyper-0.14.23/src/client/dispatch.rs:125:24      rx_dropped=false tx_dropped=false value_received=false value_sent=true              │

The resource is this one-shot channel here:

https://github.com/hyperium/hyper/blob/0.14.x/src/client/dispatch.rs#L124-L130

The number of polls in the tasks view of tokio-console is no longer increasing (although that could be because the program is stuck, and is no longer emitting traces).

fnune commented 1 year ago

The task that's stuck was created here, but that seems pretty vanilla... https://github.com/hyperium/hyper/blob/0.14.x/src/common/exec.rs#L49

fnune commented 1 year ago

Ohhh finally something interesting... looks like enabling the tracing feature of Tokio made my logs show some more stuff...

This is a healthy fetch:

2023-02-03T23:30:59.593351Z DEBUG hyper::client::pool: reuse idle connection for ("https", www.googleapis.com)
2023-02-03T23:30:59.594418Z DEBUG Connection{peer=Client}: h2::codec::framed_write: send frame=Headers { stream_id: StreamId(217), flags: (0x5: END_HEADERS | END_STREAM) }
2023-02-03T23:30:59.726889Z DEBUG Connection{peer=Client}: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(217), flags: (0x4: END_HEADERS) }
2023-02-03T23:30:59.727045Z DEBUG Connection{peer=Client}: h2::codec::framed_read: received frame=Data { stream_id: StreamId(217) }
2023-02-03T23:30:59.727125Z DEBUG Connection{peer=Client}: h2::codec::framed_read: received frame=Data { stream_id: StreamId(217), flags: (0x1: END_STREAM) }
2023-02-03T23:30:59.727204Z DEBUG Connection{peer=Client}: h2::codec::framed_read: received frame=Ping { ack: false, payload: [0, 0, 0, 0, 0, 0, 0, 216] }
2023-02-03T23:30:59.727249Z DEBUG Connection{peer=Client}: h2::codec::framed_write: send frame=Ping { ack: true, payload: [0, 0, 0, 0, 0, 0, 0, 216] }

This is the fetch where the process gets stuck:

2023-02-03T23:31:09.591212Z DEBUG hyper::client::pool: reuse idle connection for ("https", www.googleapis.com)
2023-02-03T23:31:09.592015Z DEBUG Connection{peer=Client}: h2::codec::framed_write: send frame=Headers { stream_id: StreamId(219), flags: (0x5: END_HEADERS | END_STREAM) }
2023-02-03T23:31:09.718778Z DEBUG Connection{peer=Client}: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(219), flags: (0x4: END_HEADERS) }
2023-02-03T23:31:09.718854Z DEBUG Connection{peer=Client}: h2::codec::framed_read: received frame=Data { stream_id: StreamId(219) }
2023-02-03T23:31:09.719094Z DEBUG Connection{peer=Client}: h2::codec::framed_read: received frame=Data { stream_id: StreamId(219), flags: (0x1: END_STREAM) }
2023-02-03T23:31:09.719142Z DEBUG Connection{peer=Client}: h2::codec::framed_read: received frame=Ping { ack: false, payload: [0, 0, 0, 0, 0, 0, 0, 218] }
2023-02-03T23:31:09.719163Z DEBUG Connection{peer=Client}: h2::codec::framed_write: send frame=Ping { ack: true, payload: [0, 0, 0, 0, 0, 0, 0, 218] }
2023-02-03T23:35:09.720770Z DEBUG Connection{peer=Client}: h2::codec::framed_read: received frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(2147483647), debug_data: b"session_timed_out" }
2023-02-03T23:35:09.720832Z DEBUG Connection{peer=Client}: h2::codec::framed_read: received frame=Ping { ack: false, payload: [0, 0, 0, 0, 0, 0, 0, 220] }
2023-02-03T23:35:09.720853Z DEBUG Connection{peer=Client}: h2::codec::framed_write: send frame=Ping { ack: true, payload: [0, 0, 0, 0, 0, 0, 0, 220] }
2023-02-03T23:35:09.720941Z DEBUG Connection{peer=Client}: h2::codec::framed_write: send frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(0) }
2023-02-03T23:35:09.720962Z DEBUG Connection{peer=Client}: h2::proto::connection: Connection::poll; connection error error=GoAway(b"", NO_ERROR, Library)
[2023-02-03T23:35:09Z DEBUG rustls::conn] Sending warning alert CloseNotify

Particularly this:

image

fnune commented 1 year ago

Ahh those last messages come in several minutes after the program gets stuck... :/ actually exactly 4 minutes. I tested that again...

fnune commented 1 year ago

Tomorrow I'll gather some more energy and see if I post an issue over at Hyper. This looks to me like something that should be handled at a lower level? The program can't just take up an entire CPU core indefinitely...

Byron commented 1 year ago

Thanks for sharing your extensive notes, good luck with figuring out the culprit! It's not the first time I notice strange hangs, which makes me feel a bit uneasy about the state of async in general.

That aside, maybe it's possible for you to try the latest version published 10d ago - maybe that changes something.

fnune commented 1 year ago

Upstreamed all my notes to https://github.com/hyperium/hyper/issues/3140

Byron commented 1 year ago

Thanks a lot! Maybe GoAway is indeed something new for hyper to learn, let's hope there is a solution for that.

For now I assume the issue is indeed somewhere in hyper and thus close this issue - please feel free to reopen once you know that a fix should be applied here. Thank you

fnune commented 1 year ago

Sure thing, thanks for bearing with me.

fnune commented 1 year ago

https://github.com/hyperium/hyper/issues/3140#issuecomment-1419583828

futures-executor-0.3.25/src/local_pool.rs:95

This is a red flag to me. I would check where you are calling the local pool, (is that block_on you mention at the top it?), and rip them out. They don't work well with Tokio's reactor. Replace them with either a tokio runtime block_on call, or better, make the function async itself, and just use #[tokio::main].

The reason they conflict: Tokio's reactor uses epoll_wait when it needs to wait on IO. When epoll returns, Tokio's reactor processes all events it noticed. By using futures-executor, it's block_on does something a little different. When the future is not ready, it will park the thread (std::thread::park()) until the future wakes up. The problem is that the reactor is on that same thread, and since it parked, it will stall there, and never reach the epoll_wait. Using futures-executor is a very common cause for deadlocks.

Problem solved. I removed futures::executor::block_on and used just async/await on top of #[tokio:main]. I learned quite a bit with this!

Byron commented 1 year ago

Thank you, very interesting. Lesson learned: use tokio tooling with tokio only, and don't expect all crates to interact nicely with each other.

fnune commented 1 year ago

I'd just read through Programming Rust, and they say interoperability of these libraries has been improving a lot. So I tried to YOLO through this. I guess I shouldn't have!