ArroyoSystems / arroyo

Distributed stream processing engine in Rust
https://arroyo.dev
Apache License 2.0
3.45k stars 189 forks source link

Pipeline stucks in "Scheduling" #262

Closed juchiast closed 10 months ago

juchiast commented 10 months ago

Installed on EKS, Kubernetes 1.26, 4x t3.large instances, arroyo-0.5.

values.yaml:

outputDir: "/tmp/arroyo-test"

volumes:
  - name: checkpoints
    hostPath:
      path: /tmp/arroyo-test
      type: DirectoryOrCreate

volumeMounts:
  - name: checkpoints
    mountPath: /tmp/arroyo-test

Logs:

deployment.apps/arroyo-controller:

2023-08-21T09:49:59.016441Z  INFO arroyo_controller::states: starting state machine job_id="job_wHHzn9Ezp7"
2023-08-21T09:49:59.017458Z  INFO arroyo_controller::states: state transition job_id="job_wHHzn9Ezp7" from="Created" to="Compiling" duration_ms=0
2023-08-21T09:49:59.048036Z  INFO arroyo_controller::states::compiling: Compiling pipeline job_id="job_wHHzn9Ezp7" hash="kl7rbew88rogxeqk"
2023-08-21T09:49:59.048262Z  INFO arroyo_controller::compiler: Compiling remotely on http://arroyo-compiler:9000
2023-08-21T09:49:59.048363Z  INFO arroyo_controller::compiler: digraph {
    0 [ label = "finnhub_0:WebsocketSource<wss://ws.finnhub.io/?token=...>" ]
    1 [ label = "watermark_1:Watermark" ]
    2 [ label = "sink_web_2:WebSink" ]
    3 [ label = "fused_3:expression<sql_fused<value_project,value_project>:Record>" ]
    0 -> 1 [ label = "() → finnhub :: ArroyoJsonRoot" ]
    3 -> 2 [ label = "() → generated_struct_5412302300650363671" ]
    1 -> 3 [ label = "() → finnhub :: ArroyoJsonRoot" ]
}

2023-08-21T09:50:44.708458Z  INFO arroyo_controller::states: state transition job_id="job_wHHzn9Ezp7" from="Compiling" to="Scheduling" duration_ms=45690
2023-08-21T09:58:22.867727Z  INFO arroyo_controller::states: starting state machine job_id="job_37XrqB8Fd4"
2023-08-21T09:58:22.868857Z  INFO arroyo_controller::states: state transition job_id="job_37XrqB8Fd4" from="Created" to="Compiling" duration_ms=0
2023-08-21T09:58:22.891353Z  INFO arroyo_controller::states::compiling: Compiling pipeline job_id="job_37XrqB8Fd4" hash="yolstlaxxnxwf88j"
2023-08-21T09:58:22.891399Z  INFO arroyo_controller::compiler: Compiling remotely on http://arroyo-compiler:9000
2023-08-21T09:58:22.891520Z  INFO arroyo_controller::compiler: digraph {
    0 [ label = "finnhub_0:WebsocketSource<wss://ws.finnhub.io/?token=...>" ]
    1 [ label = "watermark_1:Watermark" ]
    2 [ label = "sink_web_2:WebSink" ]
    3 [ label = "fused_3:expression<sql_fused<value_project,value_project>:Record>" ]
    0 -> 1 [ label = "() → finnhub :: ArroyoJsonRoot" ]
    3 -> 2 [ label = "() → generated_struct_5412302300650363671" ]
    1 -> 3 [ label = "() → finnhub :: ArroyoJsonRoot" ]
}

2023-08-21T09:58:37.024947Z  INFO arroyo_controller::states: state transition job_id="job_37XrqB8Fd4" from="Compiling" to="Scheduling" duration_ms=14156
2023-08-21T10:00:44.750722Z ERROR arroyo_controller::states: retryable state error job_id="job_wHHzn9Ezp7" state="Scheduling" error_message="timed out while waiting for job to start" error="timed out after 600s while waiting for worker startup" retries=3
2023-08-21T10:08:37.068349Z ERROR arroyo_controller::states: retryable state error job_id="job_37XrqB8Fd4" state="Scheduling" error_message="timed out while waiting for job to start" error="timed out after 600s while waiting for worker startup" retries=3
juchiast commented 10 months ago

replicaset.apps/arroyo-worker-job-zgfuyex3ab-1:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/main.rs:16:73
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/main.rs:16:73
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: JoinError::Panic(Id(1), ...)', src/main.rs:66:21
mwylde commented 10 months ago

Hi @juchiast thanks for checking out Arroyo!

In Kubernetes, we use a very small worker image (built by this Dockerfile) which is responsible for downloading the pipeline binary from S3 or the filesystem and starting it. That (not very clear) error message is saying that it failed to find the binary.

I believe the issue here is with your helm configuration. You've configured everything to use local paths (/tmp/arroyo-test) which will work when running everything on a single node (like in minikube) but not on a distributed cluster.

Instead you will need to configure an S3 bucket to store the pipeline artifacts, like in https://doc.arroyo.dev/deployment/kubernetes#example-eks-configuration.

Happy to help you synchronously on Discord as well if that's easier!

juchiast commented 10 months ago

Thanks! I suggest adding to the docs so that users will know local config won't work when running on EKS.

Btw, I had to manually set K8S_WORKER_SERVICE_ACCOUNT_NAME env var of arroyo-controller to make it work with s3.