Closed kpr6 closed 7 years ago
Do you have some example job ids?
Yes I do, but will you be able to access the job with it?
As an engineer who works on Google Cloud Dataflow, I'll have access to limited information about your pipeline if you supply some job ids. Alternatively you can reach Google Cloud Dataflow support directly or e-mail dataflow-feedback@google.com referencing this and providing some example job ids exhibiting the issue.
Ohh great!
Based upon the data it looks like we are spending the majority of the time on startup/shutdown (usually takes a few mins on each side to get a VM running and to tear one down) which is something you can't control. As GCE gets faster your pipeline will get faster in this regard.
It also looks like you could use more workers from the beginning since the work looks parallelizable. You have one ParDo which took ~4 mins and the initial data load took ~1min. So increasing the number of workers you start the job with will decrease these times approximately linearly.
You have little control over the BQ import which is also variable.
It looks like you could save 2-3 mins by using more workers but not much more than that since several factors are out of your control.
Please re-open if you would like to follow up further.
I'm using cloud dataflow to transfer specific no.of columns from a .csv file stored in bucket which has half a million rows and 257 columns to a BigQuery partition table. So, I've created a template and I'm passing file and table name as runtime parameters. It takes 10mins on average to complete this task which is a long time. Does it usually take this long or is it something which I'll have to fix?
The process I'm following for this is passing header list and columns list which i'll have to move as sideinputs to ParDo transform which converts String to Tablerow format to be able to write to bigquery table.