ArroyoSystems / arroyo

Distributed stream processing engine in Rust
https://arroyo.dev
Apache License 2.0
3.44k stars 188 forks source link

Register Worker call fails on Kubernetes #641

Open chokosabe opened 1 month ago

chokosabe commented 1 month ago

Have an instance running on kubernetes for ~ 10 days. Suddenly getting errors.

panicked at /app/crates/arroyo-worker/src/lib.rs:297:14: called Result::unwrap() on an Err value: Status { code: FailedPrecondition, message: "Cannot handle message for job_vR6Gen2XNs: State machine is inactive", metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Sat, 01 Jun 2024 19:10:43 GMT", "content-length": "0"} }, source: None } panic.file="/app/crates/arroyo-worker/src/lib.rs" panic.line=297 panic.column=14

This seems to be a reference to this:

https://github.com/ArroyoSystems/arroyo/blob/faa29a546bdb1bbc300ac2f9731c0dcc02b77bbe/crates/arroyo-worker/src/lib.rs#L297

This issue was still there with an Update deploy so could well be the environment thats the issue or something retained in the namespace.

chokosabe commented 1 month ago

Forgot to add that on the Frontend, I get this error:

"failed to tear down existing cluster"

This is after the check "which passes fine". The error is generated on preview and/or running the pipeline.

chokosabe commented 1 month ago

This got resolved by clearing out the artifacts and checkpoints on aws and also by deleting all the replicasets for the exsiting workers. I don't know which of these fixed things. It'd be great if the error message pointed out exactly what the app was trying to do when it errored. i.e Which pods or replicasets it was trying to delete that triggered the error