The PostRelease Nightly Snapshot job is flaky

apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.

https://beam.apache.org/

Apache License 2.0

7.69k stars 4.2k forks source link

The PostRelease Nightly Snapshot job is flaky #30505

Open github-actions[bot] opened 4 months ago

github-actions[bot] commented 4 months ago

The PostRelease Nightly Snapshot is failing over 50% of the time Please visit https://github.com/apache/beam/actions/workflows/beam_PostRelease_NightlySnapshot.yml?query=is%3Afailure+branch%3Amaster to see the logs.

shunping commented 4 months ago

Related to ##30447

Abacn commented 4 months ago

Still failing:

Container image gcr.io/cloud-dataflow/v1beta3/beam_java8_sdk:beam-master-20240306 not downloaded yet.

It is strange that the container gets resolved to "beam_java8_sdk:beam-master-20240306". What happens is it picks the label for legacy runner but actually trying to pull runner v2 image. This is likely due to Dataflow switched to runner v2 by default in Beam 2.55.0+

https://github.com/apache/beam/blob/ef919e2603fcd6bffde2a15961d1f186448520a9/runners/google-cloud-dataflow-java/build.gradle#L54-L55

entered #30634

liferoad commented 3 months ago

https://github.com/apache/beam/actions/runs/8619063045

java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
POST https://bigquery.googleapis.com/bigquery/v2/projects/apache-beam-testing/datasets/beam_postrelease_mobile_gaming/tables/leaderboard_DataflowRunner_team/insertAll?prettyPrint=false
{
  "code" : 404,
  "errors" : [ {
    "domain" : "global",
    "message" : "Not found: Table apache-beam-testing:beam_postrelease_mobile_gaming.leaderboard_DataflowRunner_team",
    "reason" : "notFound"
  } ],
  "message" : "Not found: Table apache-beam-testing:beam_postrelease_mobile_gaming.leaderboard_DataflowRunner_team",
  "status" : "NOT_FOUND"
}

liferoad commented 3 months ago

Looks much better. Close this now.

Abacn commented 1 month ago

Currently there is a flakiness due to downloading artifacts from maven snapshot repository not get retried. This is a maven tool thing, but probably we can first build (with retry) so the artifacts are get cached in local maven

liferoad commented 1 month ago

@shunping please check this when you have time.

damondouglas commented 1 month ago

Related to the maven snapshot issue. I wonder if we could use artifact registry's ability to store Java packages https://cloud.google.com/artifact-registry/docs/java/store-java, instead of relying on maven central.

liferoad commented 1 month ago


[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project word-count-beam: An exception occured while executing the Java class. java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found |  
-- | --
  | [ERROR] POST https://bigquery.googleapis.com/bigquery/v2/projects/apache-beam-testing/datasets/beam_postrelease_mobile_gaming/tables/leaderboard_DirectRunner_team/insertAll?prettyPrint=false |  
  | [ERROR] { |  
  | [ERROR]   "code" : 404, |  
  | [ERROR]   "errors" : [ { |  
  | [ERROR]     "domain" : "global", |  
  | [ERROR]     "message" : "Not found: table Table is deleted: 844138762903:beam_postrelease_mobile_gaming.leaderboard_DirectRunner_team", |  
  | [ERROR]     "reason" : "notFound" |  
  | [ERROR]   } ], |  
  | [ERROR]   "message" : "Not found: table Table is deleted: 844138762903:beam_postrelease_mobile_gaming.leaderboard_DirectRunner_team", |  
  | [ERROR]   "status" : "NOT_FOUND" |  
  | [ERROR] } |  
  | [ERROR] -> [Help 1] |  
  | [ERROR]

liferoad commented 1 month ago

Can we just add the retry to this task?

chamikaramj commented 3 weeks ago

Looking at some of the recent failures seems like Java command was just crashing ?

https://github.com/apache/beam/actions/runs/9537373049/job/26285395593 https://ge.apache.org/s/pmba6vnub3yz4

"Process 'command '/opt/hostedtoolcache/Java_Temurin-Hotspot_jdk/8.0.412-8/x64/bin/java'' finished with non-zero exit value 1"

chamikaramj commented 3 weeks ago

I also see the 404 error from BQ mentioned above in other failed runs, so seems like there are at least two failure modes.

chamikaramj commented 3 weeks ago

I wonder if Java failure was due to an OOM. Can we increase the memory available to VMs running these tests ?

damccorm commented 2 weeks ago

Trying this with #31749