GoogleCloudPlatform / DataflowJavaSDK

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
http://cloud.google.com/dataflow
855 stars 323 forks source link

Google API Client Library version 1.23.0 causes runtime problems with Dataflow Java SDK #607

Open moandcompany opened 7 years ago

moandcompany commented 7 years ago

The new Google API Client Library, version 1.23.0, appears to cause problems with the Dataflow Java SDK when submitting and/or running jobs.

This appears to affect Dataflow Java SDKs in both major version families (e.g. 1.9.1, 2.0.0, and 2.1.0)

In some cases, these problems manifest as 404 HTTP errors when attempting to upload staging files

Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.io.IOException: Error executing batch GCS request :userprofile:run
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:322)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:292)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:200)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:63)
(...)

Caused by: java.util.concurrent.ExecutionException: com.google.api.client.http.HttpResponseException: 404 Not Found
Not Found
at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:500)
at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:479)
at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76)
at org.apache.beam.sdk.util.GcsUtil.executeBatches(GcsUtil.java:611)
at org.apache.beam.sdk.util.GcsUtil.getObjects(GcsUtil.java:358)
at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.matchNonGlobs(GcsFileSystem.java:217)
at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.match(GcsFileSystem.java:86)
(...)

Caused by: com.google.api.client.http.HttpResponseException: 404 Not Found
Not Found
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1070)
at com.google.api.client.googleapis.batch.BatchRequest.execute(BatchRequest.java:241)
at org.apache.beam.sdk.util.GcsUtil$3.call(GcsUtil.java:604)
at org.apache.beam.sdk.util.GcsUtil$3.call(GcsUtil.java:602)
at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)

Workaround: Pinning Google API Client Library dependencies to version 1.22.0 appears to avoid this issue

Gradle Example:

compile ('com.google.api-client:google-api-client:1.22.0') {
        force = true
    }

Maven Example:

<dependency>
  <groupId>com.google.api-client</groupId>
  <artifactId>google-api-client</artifactId>
  <version>[1.22.0]</version>
</dependency>
polleyg commented 7 years ago

We've had the same problem. Except for us, it was with the the BigQuery API that we were bringing into our project. Removing it fixed it (Beam has a dependancy in it anyway).

pheromonez commented 7 years ago

We're also experiencing issues during file staging. Before the attempt to upload files is made, we receive this error: WARNING: Request failed with code 409, performed 0 retries due to IOExceptions, performed 0 retries due to unsuccessful status codes, HTTP framework says request can be retried, (caller responsible for retrying): https://www.googleapis.com/storage/v1/b?predefinedAcl=projectPrivate&predefinedDefaultObjectAcl=projectPrivate&project=<project name omitted>

Accessing the HTTP resource specified will return JSON data, within which there is an error with message Anonymous users does not have storage.buckets.list access to project <project number omitted>.

afcastano commented 7 years ago

We had the same issue and we can confirm that as @moandcompany suggest, this fixes it:

compile ('com.google.api-client:google-api-client:1.22.0') {
        force = true
    }

For the record, our stack trace is pretty similar. We are running 2.2.0 snapshot version of apache beam:

java.io.IOException: Error executing batch GCS request
        at org.apache.beam.sdk.util.GcsUtil.executeBatches(GcsUtil.java:603)
        at org.apache.beam.sdk.util.GcsUtil.getObjects(GcsUtil.java:342)
        at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.matchNonGlobs(GcsFileSystem.java:217)
        at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.match(GcsFileSystem.java:86)
        at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:125)
        at org.apache.beam.sdk.io.FileSystems.matchSingleFileSpec(FileSystems.java:190)
        at org.apache.beam.runners.dataflow.util.PackageUtil.alreadyStaged(PackageUtil.java:159)
        at org.apache.beam.runners.dataflow.util.PackageUtil.stagePackageSynchronously(PackageUtil.java:188)
        at org.apache.beam.runners.dataflow.util.PackageUtil.access$000(PackageUtil.java:69)
        at org.apache.beam.runners.dataflow.util.PackageUtil$2.call(PackageUtil.java:176)
        at org.apache.beam.runners.dataflow.util.PackageUtil$2.call(PackageUtil.java:173)
        at org.apache.beam.runners.dataflow.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
        at org.apache.beam.runners.dataflow.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
        at org.apache.beam.runners.dataflow.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: com.google.api.client.http.HttpResponseException: 404 Not Found
Not Found
        at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:500)
        at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:459)
        at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76)
        at org.apache.beam.sdk.util.GcsUtil.executeBatches(GcsUtil.java:595)
        ... 16 more
zinuzoid commented 7 years ago

I got similar problem. Here's the API response.

Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "(249a6f2653c550b0): The workflow was automatically rejected by the service because it may trigger an identified bug in the SDK.\nBug details: com.google.api-client:google-api-client library version 1.23.0 is not supported..\nContact dataflow-feedback@google.com for further help. Please use this identifier in your communication: 67379331.",
    "reason" : "badRequest"
  } ],
  "message" : "(249a6f2653c550b0): The workflow was automatically rejected by the service because it may trigger an identified bug in the SDK.\nBug details: com.google.api-client:google-api-client library version 1.23.0 is not supported..\nContact dataflow-feedback@google.com for further help. Please use this identifier in your communication: 67379331.",
  "status" : "INVALID_ARGUMENT"
}
lukecwik commented 7 years ago

Google added support to reject jobs from being created with this issue to prevent users from starting malformed jobs.

frew commented 7 years ago

The root cause for the 404's is outlined at https://github.com/google/google-api-java-client/issues/1073. Hilariously, you can't get to the error rejecting the job for bad dependencies until you've cleared up the staging problem (in our case by upgrading to com.google.apis:google-api-services-storage:v1-rev115-1.23.0 ). Is there another problem that's causing the job rejection? We're being forced to 1.23.0 by a bug in another Google API so this puts us between a rock and a hard place because lol @ Java versioning on Maven.

Jdban commented 6 years ago

+1 happening to us too. Is there any suggested remedy?

moandcompany commented 6 years ago

The Cloud Dataflow team has added a page on Dataflow SDK and Worker Dependencies that identifies the google-api-client 1.22.0 version requirement (Java)

Jdban commented 6 years ago

The Cloud Dataflow team has added a page on Dataflow SDK and Worker Dependencies that identifies the google-api-client 1.22.0 version requirement (Java)

That is a useful link, but not really a solution for those of us like @frew who need to use google-api-client 1.23.0 due to a bug in another library

sgri commented 6 years ago

I also have this issue

ghost commented 6 years ago

any updates? Im running into this issue

alan-ma-umg commented 6 years ago

same here. apache beam 2.3.0 with dataflowrunner having the same 404 error. A permanent fix would be ideal.

Thanks.

dsquier commented 6 years ago

We encountered this as well. We're on Scio 0.5.5-beta1 and attempted to force the version to 1.2.2 using Overrides never worked. However, explicitly adding this library with a force() did work, i.e.,

"com.google.api-client" % "google-api-client" % "1.22.0" force()
gfengster commented 6 years ago

I have the same problem. Google forces moving out of storage@v1. Add

com.google.apis google-api-services-storage v1-rev115-1.23.0

The runtime error becomes Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.NoClassDefFoundError: com/google/api/gax/rpc/HeaderProvider It looks libraries conflict across Google's infrastructure libraries. Horrible.

andrewcassidy commented 6 years ago

@dsquier omg thank you. I was battling dependencyOverrides for a while and didn't think about force.

pabloazurduy commented 6 years ago

I was redirected here from google because I was using the bigquery-client library and the same error appeared. Does anybody found a workaround to this issue? I've tried (without success)

    <dependency>
      <groupId>com.google.cloud</groupId>
      <artifactId>google-cloud-bigquery</artifactId>
      <version>0.21.0-beta</version>
    </dependency>
pievis commented 6 years ago

After analyzing my dependencies and checking the error, I was able to fix this by forcing the version of google-api-services-dataflow to v1b3-rev221-1.22.0 (and of course setting google-api-client to version 1.22.0)

Only setting google-api-client to the old version wasn't enough for me since I had the following error thrown:


java.io.IOException: Error executing batch GCS request
        at org.apache.beam.sdk.util.GcsUtil.executeBatches(GcsUt

when trying to compile my dataflow template

vinnybod commented 6 years ago

For anyone else still seeing issues like this, check out the version numbers here and make sure you aren't importing a conflicting dependency.

labianchin commented 6 years ago

Now Beam 2.5.0 depends on google-api-client:1.23.0, see https://cloud.google.com/dataflow/docs/concepts/sdk-worker-dependencies. Is this still an issue?