GoogleCloudPlatform / DataflowJavaSDK

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
http://cloud.google.com/dataflow
855 stars 324 forks source link

DatastoreIO can't read large dataset from datastore #537

Closed email2liyang closed 7 years ago

email2liyang commented 7 years ago

we got a google datastore Entities with row count 61,647,456, the total size of the entities is around 21.4G. when we try to read the whole table with code below.

final Pipeline pipeline = Pipeline.create(options);
    pipeline.apply("read from datastore",
                   DatastoreIO.v1().read().withProjectId(options.getProject())
                       .withNamespace(CITE_NAMESPACE)
                       .withQuery(queryBuilder.build()))
                  .apply("xx",ParDo.of(new DoFn(){});

pipeline.run()

I we have tried 3 times with

is there anything we can do to tune the steps?

(c4effe748a3cb249): Workflow failed. Causes: (4c8d3a94c2fb7abe): S09:read from datastore/GroupByKey/Read+read from datastore/GroupByKey/GroupByWindow+read from datastore/Values/Values+read from datastore/Flatten.FlattenIterables/FlattenIterables+read from datastore/ParDo(Read)+get family id without applicant+write family id out/Write/DataflowPipelineRunner.BatchWrite/Window.Into()+write family id out/Write/DataflowPipelineRunner.BatchWrite/WriteBundles+write family id out/Write/DataflowPipelineRunner.BatchWrite/View.AsIterable/DataflowPipelineRunner.BatchViewAsIterable/ParDo(ToIsmRecordForGlobalWindow) failed.

email2liyang commented 7 years ago

I got answer from google support, we could export the datastore backup file into google cloud storage and import the backup file into google big query. then we can query and export result as csv file