[Bug]: Possible data loss in BigtableIO r/w if timestamp not set (default to epoch)

What happened?

Reported from https://github.com/GoogleCloudPlatform/DataflowTemplates/pull/759

When implementing a load test for BigTableIO, we encountered the following:

load tests up to 200mb pass stably.
after 5 million records, not all data gets into BigTable, but the pipeline logs indicate that all data was written.

Dataflow write pipeline logs say that 10M records were written. However, the read job shows only 1.6M records read.

Using the cbt utility, the cbt -instance count

command found out that BigTableIO write did not work correctly. Despite the fact that the logs say that all 10M records were written, in fact, there were exactly as many in the table as the read pipeline processed (1.6M). Some of the records processed by the write pipeline did not get into the table.

Dataflow write pipeline logs - 2023-06-05_03_51_23-9051905355392445711
Dataflow read pipeline logs - 2023-06-05_03_58_18-7016807525741705033

project: apache-beam-testing

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

[ ] Component: Python SDK
[X] Component: Java SDK
[ ] Component: Go SDK
[ ] Component: Typescript SDK
[X] Component: IO connector
[ ] Component: Beam examples
[ ] Component: Beam playground
[ ] Component: Beam katas
[ ] Component: Website
[ ] Component: Spark Runner
[ ] Component: Flink Runner
[ ] Component: Samza Runner
[ ] Component: Twister2 Runner
[ ] Component: Hazelcast Jet Runner
[ ] Component: Google Cloud Dataflow Runner

Abacn commented 1 year ago

This is not related to Beam, it's DataflowTemplate test utility resource manager has wrong setting

The cause was found there: https://github.com/GoogleCloudPlatform/DataflowTemplates/pull/759#discussion_r1220523303

I find the cause is
- cell does not have set timestamp, so it default to epoch (1970-01-01)
- the createTable has a garbage collection policy of 1h, so large amount data written triggers GC and some records get deleted

We need to use `.setTimestampMicros(java.time.Instant.now().toEpochMilli()*1000)` for Mutation.SetCell

=========== (obsolete)

Tested with Beam 2.47.0, 2.48.0, table created with BigtableTableAdminClient.createTable, expected number of records (tested with 20M records and 100M records) (jobId: 2023-06-06_14_34_14-776136986672260899, 2023-06-06_15_02_28-18162755370264063675)
Tested with BigtableIOLT in DataflowTemplate, Beam 2.47.0, record missing (jobId: 2023-06-06_14_52_02-13679425791336453528)
Tested with BigtableIOLT in DataflowTemplate, Beam 2.47.0, table created with BigtableTableAdminClient.createTable, expected number of records (jobId: 2023-06-06_15_32_45-12170662821065212708)

For the job resulting in table missing, use cbt ... -instance <instance> count <table> found the number of records decreased half way writing.

Abacn commented 1 year ago

Turns out that this could also affect real usage case when Timestamp field is not set, reopen it and keep it as P1 also

Abacn commented 1 year ago

Posible solutions:

when incoming Timestamp is empty, default set to current time (instead of epoch)
when incoming Timestamp is empty, raise a load warning

Abacn commented 1 year ago

CC: @mutianf @ahmedabu98 (this also affects xlang Bigtable)

Abacn commented 1 year ago

per https://github.com/apache/beam/pull/28624#discussion_r1338601869 at least we should add some validation in write transform

kennknowles commented 8 months ago

Would the followup be P2 or still P1?

Abacn commented 8 months ago

This is due to user bug (incorrect/epoch) timestamp attached to the cell. The issue is kept open because there is follow up (add warning) can be done so kept P2, and update issue title

apache / beam