GoogleCloudPlatform / DataflowJavaSDK

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
http://cloud.google.com/dataflow
855 stars 323 forks source link

Extending BigqueryTableInserter to take maximum batch size #488

Closed KotaiVictor closed 7 years ago

KotaiVictor commented 7 years ago

We ran into an issue in production recently where one of our workers was stalling on a particular hotkey.

Looking further into the issue we figured out that the given our row sizes we were inserting somewhere between 5 and 20 elements per call.

This coupled with our windowing size of around 5000 and the fixed number of 100 threads the BigQueryTableInserter uses to do inserts made it slow down.

We thought it would be useful to be able to configure this.

Ideally, we'd be able to set the thread count for the Executor too. Give that it's static, you'll be able to do at most 100 inserts in parallel from a worker.

googlebot commented 7 years ago

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.


KotaiVictor commented 7 years ago

I signed it!

googlebot commented 7 years ago

CLAs look good, thanks!

dhalperi commented 7 years ago

Looks fine to me, but don't you want a way to plumb the option actually from the BigQueryIO.Write? Otherwise, this is unused code.

Can you please submit any upgrades to Apache Beam, which will be the basis of Dataflow 2.0? That way they will not be lost in future.

dhalperi commented 7 years ago

Hi @KotaiVictor,

Due to inactivity, I'm going to close this PR for now. However, if you are ready to discuss again please let me know.

Thanks!