When implementing a load test for BigTableIO, we encountered the following:
load tests up to 200mb pass stably.
after 5 million records, not all data gets into BigTable, but the pipeline logs indicate that all data was written.
Dataflow write pipeline logs say that 10M records were written.
However, the read job shows only 1.6M records read.
Using the cbt utility, the cbt -instance count
command found out that BigTableIO write did not work correctly. Despite the fact that the logs say that all 10M records were written, in fact, there were exactly as many in the table as the read pipeline processed (1.6M). Some of the records processed by the write pipeline did not get into the table.
I find the cause is
- cell does not have set timestamp, so it default to epoch (1970-01-01)
- the createTable has a garbage collection policy of 1h, so large amount data written triggers GC and some records get deleted
We need to use `.setTimestampMicros(java.time.Instant.now().toEpochMilli()*1000)` for Mutation.SetCell
===========
(obsolete)
Tested with Beam 2.47.0, 2.48.0, table created with BigtableTableAdminClient.createTable, expected number of records (tested with 20M records and 100M records) (jobId: 2023-06-06_14_34_14-776136986672260899, 2023-06-06_15_02_28-18162755370264063675)
Tested with BigtableIOLT in DataflowTemplate, Beam 2.47.0, record missing (jobId: 2023-06-06_14_52_02-13679425791336453528)
Tested with BigtableIOLT in DataflowTemplate, Beam 2.47.0, table created with BigtableTableAdminClient.createTable, expected number of records (jobId: 2023-06-06_15_32_45-12170662821065212708)
For the job resulting in table missing, use cbt ... -instance <instance> count <table> found the number of records decreased half way writing.
This is due to user bug (incorrect/epoch) timestamp attached to the cell. The issue is kept open because there is follow up (add warning) can be done so kept P2, and update issue title
What happened?
Reported from https://github.com/GoogleCloudPlatform/DataflowTemplates/pull/759
When implementing a load test for BigTableIO, we encountered the following:
Dataflow write pipeline logs say that 10M records were written. However, the read job shows only 1.6M records read.
Using the cbt utility, the cbt -instance count command found out that BigTableIO write did not work correctly. Despite the fact that the logs say that all 10M records were written, in fact, there were exactly as many in the table as the read pipeline processed (1.6M). Some of the records processed by the write pipeline did not get into the table.
- Dataflow write pipeline logs -
- Dataflow read pipeline logs -
- [ ] Component: Python SDK
- [X] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [X] Component: IO connector
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner
Abacn
commented
1 year ago
- Tested with Beam 2.47.0, 2.48.0, table created with BigtableTableAdminClient.createTable, expected number of records (tested with 20M records and 100M records) (jobId:
- Tested with BigtableIOLT in DataflowTemplate, Beam 2.47.0, record missing (jobId:
- Tested with BigtableIOLT in DataflowTemplate, Beam 2.47.0, table created with BigtableTableAdminClient.createTable, expected number of records (jobId:
Abacn
commented
1 year ago
Abacn
commented
1 year ago
- when incoming Timestamp is empty, default set to current time (instead of epoch)
- when incoming Timestamp is empty, raise a load warning
Abacn
commented
1 year ago
Abacn
commented
1 year ago
kennknowles
commented
8 months ago
Abacn
commented
8 months ago
- © Githubissues.
- Githubissues is a development platform for aggregating issues.
2023-06-05_03_51_23-9051905355392445711
2023-06-05_03_58_18-7016807525741705033
project: apache-beam-testing
Issue Priority
Priority: 1 (data loss / total loss of function)
Issue Components
This is not related to Beam, it's DataflowTemplate test utility resource manager has wrong setting
The cause was found there: https://github.com/GoogleCloudPlatform/DataflowTemplates/pull/759#discussion_r1220523303
=========== (obsolete)
2023-06-06_14_34_14-776136986672260899
,2023-06-06_15_02_28-18162755370264063675
)2023-06-06_14_52_02-13679425791336453528
)2023-06-06_15_32_45-12170662821065212708
)For the job resulting in table missing, use
cbt ... -instance <instance> count <table>
found the number of records decreased half way writing.Turns out that this could also affect real usage case when Timestamp field is not set, reopen it and keep it as P1 also
Posible solutions:
CC: @mutianf @ahmedabu98 (this also affects xlang Bigtable)
per https://github.com/apache/beam/pull/28624#discussion_r1338601869 at least we should add some validation in write transform
Would the followup be P2 or still P1?
This is due to user bug (incorrect/epoch) timestamp attached to the cell. The issue is kept open because there is follow up (add warning) can be done so kept P2, and update issue title