GoogleCloudPlatform / DataflowJavaSDK

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
http://cloud.google.com/dataflow
855 stars 324 forks source link

Error 409 from BigQuery when using in template #550

Open sduck opened 7 years ago

sduck commented 7 years ago

I'm doing a simple batch-job, that I'm implementing as a template. It is supposed to read data from BigQuery. Everything works fine on the first run, but all subsequent executions of template results in an error from BigQuery service: "Request failed with code 409, will NOT retry: https://www.googleapis.com/bigquery/v2/projects/boozt-ga/jobs"

I can see the all executions ends up giving the BigQuery extract job the exact same jobid and that seems to be the reason that BigQuery fails.

davorbonaci commented 7 years ago

@sduck, thanks for your report. Indeed, this is an issue we are looking at.

@sammcveety, can you perhaps comment more?

sammcveety commented 7 years ago

@sduck this is a limitation in the current SDK, documented at https://cloud.google.com/dataflow/docs/templates/creating-templates#pipeline-io-and-runtime-parameters. We are working to remove this restriction in future releases.

sduck commented 7 years ago

Thanks for the updates - looking forward to a solution on this restriction.

ptf commented 7 years ago

Hi @sammcveety how far off would a <2.0 release be?

sammcveety commented 7 years ago

If you mean >=2.0, there is already a 2.0beta2 out. https://github.com/apache/beam/pull/2123 addresses BQ.Read.

On Mon, Mar 13, 2017 at 3:39 PM, Paul Findlay notifications@github.com wrote:

Hi @sammcveety https://github.com/sammcveety our far off would a <2.0 release be?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/550#issuecomment-286265984, or mute the thread https://github.com/notifications/unsubscribe-auth/ACQVgVE6k2dLP_gnQyvC0wLuA_cD0s53ks5rlcWTgaJpZM4MAlvx .

ptf commented 7 years ago

@sammcveety If we are talking 2.0, when would a GA release be expected for support, business approval etc.? But will there be a backport of the bugfix in the dataflow 1.9.x sdk?

sammcveety commented 7 years ago

I believe a tentative date of Q2 was announced at Next17. There are no plans for a backport to 1.9.

On Thu, Mar 16, 2017 at 2:56 PM, Paul Findlay notifications@github.com wrote:

@sammcveety https://github.com/sammcveety If we are talking 2.0, when would a GA release be expected for support, business approval etc.? But will there be a backport of the bugfix in the dataflow 1.9.x sdk?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/550#issuecomment-287203256, or mute the thread https://github.com/notifications/unsubscribe-auth/ACQVgRLbv_KK3nAfkg9EmjbN7L4hpxd2ks5rmbAWgaJpZM4MAlvx .

kmaillet commented 7 years ago

I'm having this exact problem @sammcveety with beam-sdks-java-io-google-cloud-platform 0.6.0 . Can't wait for the solution ;)

sammcveety commented 7 years ago

https://github.com/apache/beam/pull/2123 in progress

On Thu, Apr 27, 2017 at 10:05 AM, kmaillet-arcane notifications@github.com wrote:

I'm having this exact problem @sammcveety https://github.com/sammcveety . Can't wait for the solution ;)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/550#issuecomment-297777562, or mute the thread https://github.com/notifications/unsubscribe-auth/ACQVgYvLz-rFWmvlC49RwXB5k1xUIaslks5r0MrKgaJpZM4MAlvx .

HayoVanLoon commented 7 years ago

Just wanted to add that the cloud console will list such a repeated batch job as successful, despite having no output (for the bq load step in any case). Although I should have read the documentation more carefully (mea culpa), it had me confused for some time (until I finally landed here). Hope we'll see 2.0 soon, will work around it with TextIO & a separate load in the mean time I suppose.

domparry commented 7 years ago

It would be a huge pity not to back port to 1.9.x, since the templating feature exists and is pretty much paralyzed for BQ where it's most useful. What's the logic behind not doing so? Is it more work than everyone porting to >2.0.

sammcveety commented 7 years ago

There are no plans to make further enhancements to 1.9. Upgrading to 2.0 should be relatively easy comma please let us know if you encounter issues.

On Jun 14, 2017 1:27 AM, "domparry" notifications@github.com wrote:

It would be a huge pity not to back port to 1.9.x, since the templating feature exists and is pretty much paralyzed for BQ where it's most useful. What's the logic behind not doing so? Is it more work than everyone porting to >2.0.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/550#issuecomment-308360166, or mute the thread https://github.com/notifications/unsubscribe-auth/ACQVga3BsEJUvoIGD46SMB5ZR075W6fIks5sD5lzgaJpZM4MAlvx .

jaebinyo commented 7 years ago

It seems this problem still exists with v2.0.0. The document for SDK 2.X says

"* For BigQuery batch pipelines, templates can only be executed once, as the BigQuery job ID is set at template creation time. This restriction will be removed in a future release.".

What's the point of supporting templates if it can only be executed once? It would be better to say it doesn't support templating. I wasted my several hours because I missed that fine-print.

sammy88888888 commented 6 years ago

Is there any time estimation for implementing this yet?

ganaz commented 6 years ago

Its fixed in beam sdk 2.3.0, https://cloud.google.com/dataflow/docs/templates/creating-templates#pipeline-io-and-runtime-parameters

radhikakale commented 3 years ago

I am having same issue with Beam version 2.25.0