we got a google datastore Entities with row count 61,647,456, the total size of the entities is around 21.4G.
when we try to read the whole table with code below.
final Pipeline pipeline = Pipeline.create(options);
pipeline.apply("read from datastore",
DatastoreIO.v1().read().withProjectId(options.getProject())
.withNamespace(CITE_NAMESPACE)
.withQuery(queryBuilder.build()))
.apply("xx",ParDo.of(new DoFn(){});
pipeline.run()
I we have tried 3 times with
5 workers on machine type n1-standard-1 failed at 6,531,183 rows of read
7 workers on machine type n1-standard-4 failed at 19,284,163 rows of read
30 workers on machine type n1-standard-2 failed at 5,744,184 rows of read
they all failed at stage "read from datastore" when trying to GroupByKey, for 3rd run, I think it's we have provided enough compute resources to just read from datastore, it sill failed, can read the lowest number of data.
is there anything we can do to tune the steps?
(c4effe748a3cb249): Workflow failed. Causes: (4c8d3a94c2fb7abe): S09:read from datastore/GroupByKey/Read+read from datastore/GroupByKey/GroupByWindow+read from datastore/Values/Values+read from datastore/Flatten.FlattenIterables/FlattenIterables+read from datastore/ParDo(Read)+get family id without applicant+write family id out/Write/DataflowPipelineRunner.BatchWrite/Window.Into()+write family id out/Write/DataflowPipelineRunner.BatchWrite/WriteBundles+write family id out/Write/DataflowPipelineRunner.BatchWrite/View.AsIterable/DataflowPipelineRunner.BatchViewAsIterable/ParDo(ToIsmRecordForGlobalWindow) failed.
I got answer from google support, we could export the datastore backup file into google cloud storage and import the backup file into google big query. then we can query and export result as csv file
we got a google datastore Entities with row count 61,647,456, the total size of the entities is around 21.4G. when we try to read the whole table with code below.
I we have tried 3 times with
30 workers on machine type n1-standard-2 failed at 5,744,184 rows of read
they all failed at stage "read from datastore" when trying to GroupByKey, for 3rd run, I think it's we have provided enough compute resources to just read from datastore, it sill failed, can read the lowest number of data.
is there anything we can do to tune the steps?
(c4effe748a3cb249): Workflow failed. Causes: (4c8d3a94c2fb7abe): S09:read from datastore/GroupByKey/Read+read from datastore/GroupByKey/GroupByWindow+read from datastore/Values/Values+read from datastore/Flatten.FlattenIterables/FlattenIterables+read from datastore/ParDo(Read)+get family id without applicant+write family id out/Write/DataflowPipelineRunner.BatchWrite/Window.Into()+write family id out/Write/DataflowPipelineRunner.BatchWrite/WriteBundles+write family id out/Write/DataflowPipelineRunner.BatchWrite/View.AsIterable/DataflowPipelineRunner.BatchViewAsIterable/ParDo(ToIsmRecordForGlobalWindow) failed.