GoogleCloudPlatform / DataflowJavaSDK

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
http://cloud.google.com/dataflow
855 stars 324 forks source link

Cloud Dataflow poor performance while loading .csv files to bigquery #577

Closed kpr6 closed 7 years ago

kpr6 commented 7 years ago

I'm using cloud dataflow to transfer specific no.of columns from a .csv file stored in bucket which has half a million rows and 257 columns to a BigQuery partition table. So, I've created a template and I'm passing file and table name as runtime parameters. It takes 10mins on average to complete this task which is a long time. Does it usually take this long or is it something which I'll have to fix?

The process I'm following for this is passing header list and columns list which i'll have to move as sideinputs to ParDo transform which converts String to Tablerow format to be able to write to bigquery table.

lukecwik commented 7 years ago

Do you have some example job ids?

kpr6 commented 7 years ago

Yes I do, but will you be able to access the job with it?

lukecwik commented 7 years ago

As an engineer who works on Google Cloud Dataflow, I'll have access to limited information about your pipeline if you supply some job ids. Alternatively you can reach Google Cloud Dataflow support directly or e-mail dataflow-feedback@google.com referencing this and providing some example job ids exhibiting the issue.

kpr6 commented 7 years ago

Ohh great!

  1. 2017-06-02_01_55_00-8333187844322894904
  2. 2017-06-02_02_38_12-12774089450158834578 Above are the job ID's for you. So, if I may add to the problem, it has some inconsistency too which is, template job succeeds and loads data to the corresponding partition of a table the first time after creating a template but in later jobs using the template, the job succeeds but the data isn't loaded to the partition. So, the first job id was successful in loading the data but not the second. Can you please help here too, would mean a lot. Thank you!
lukecwik commented 7 years ago

Based upon the data it looks like we are spending the majority of the time on startup/shutdown (usually takes a few mins on each side to get a VM running and to tear one down) which is something you can't control. As GCE gets faster your pipeline will get faster in this regard.

It also looks like you could use more workers from the beginning since the work looks parallelizable. You have one ParDo which took ~4 mins and the initial data load took ~1min. So increasing the number of workers you start the job with will decrease these times approximately linearly.

You have little control over the BQ import which is also variable.

It looks like you could save 2-3 mins by using more workers but not much more than that since several factors are out of your control.

lukecwik commented 7 years ago

Please re-open if you would like to follow up further.