apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.9k stars 4.27k forks source link

The LoadTests Go GBK Flink Batch job is flaky #30507

Open github-actions[bot] opened 9 months ago

github-actions[bot] commented 9 months ago

The LoadTests Go GBK Flink Batch is failing over 50% of the time Please visit https://github.com/apache/beam/actions/workflows/beam_LoadTests_Go_GBK_Flink_Batch.yml?query=is%3Afailure+branch%3Amaster to see the logs.

volatilemolotov commented 7 months ago

Tried increasing the timeout to almost 12h but it still times out https://github.com/volatilemolotov/beam/actions/runs/8627370677/job/23647207749

github-actions[bot] commented 3 months ago

Reopening since the workflow is still flaky

liferoad commented 2 weeks ago
Caused by: java.io.IOException: Cannot run program "docker": error=2, No such file or directory
    at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
    at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
    at org.apache.beam.runners.fnexecution.environment.DockerCommand.runShortCommand(DockerCommand.java:207)
    at org.apache.beam.runners.fnexecution.environment.DockerCommand.runShortCommand(DockerCommand.java:181)
liferoad commented 1 week ago

Tested this locally:

./gradlew :sdks:go:test:load:run -PloadTest.mainClass=group_by_key -Prunner=FlinkRunner -PloadTest.args='--influx_namespace=flink --influx_measurement=go_batch_gbk_1 --input_options="{\"num_records\":200000000,\"key_size\":1,\"value_size\":9}" --iterations=1 --fanout=1 --parallelism=5 --endpoint=localhost:8099 --environment_type=DOCKER --environment_config=gcr.io/apache-beam-testing/beam-sdk/beam_go_sdk:latest --runner=FlinkRunner'
024/11/21 21:58:01 Failed to execute job:      connecting to job service
failed to dial server at localhost:8099
        caused by:
context deadline exceeded
panic: Failed to execute job:   connecting to job service
        failed to dial server at localhost:8099
                caused by:
        context deadline exceeded

goroutine 1 [running]:
github.com/apache/beam/sdks/v2/go/pkg/beam/log.Fatalf({0x234e280, 0x3bc7c60}, {0x21193ff?, 0x3bc7c60?}, {0xc00078ff28?, 0x0?, 0x0?})
        /usr/local/google/home/xqhu/Dev/beam/sdks/go/pkg/beam/log/log.go:162 +0x7d
main.main()
        /usr/local/google/home/xqhu/Dev/beam/sdks/go/test/load/group_by_key/group_by_key.go:98 +0x3c9

> Task :sdks:go:test:load:run FAILED

FAILURE: Build failed with an exception.
liferoad commented 1 week ago
image
liferoad commented 1 week ago

Steps to run a local test

  1. run the local flink cluster
    wget https://downloads.apache.org/flink/flink-1.17.2/flink-1.17.2-bin-scala_2.12.tgz
    tar zxvf flink-1.17.2-bin-scala_2.12.tgz
    cd flink-1.17.2
    ./bin/start-cluster.sh
  2. run the job server
    docker run --net=host gcr.io/apache-beam-testing/beam_portability/beam_flink1.17_job_server --flink-master=localhost:8081
  3. run a Go test
    ./gradlew :sdks:go:test:load:run -PloadTest.mainClass=group_by_key -Prunner=FlinkRunner -PloadTest.args='--influx_namespace=flink --influx_measurement=go_batch_gbk_1 --input_options="{\"num_records\":200,\"key_size\":1,\"value_size\":9}" --iterations=1 --fanout=1 --parallelism=1 --endpoint=localhost:8099 --environment_type=DOCKER --environment_config=gcr.io/apache-beam-testing/beam-sdk/beam_go_sdk --runner=PortableRunner'