apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.88k stars 4.27k forks source link

[Bug]: Possible data loss in BigtableIO r/w if timestamp not set (default to epoch) #27022

Open Abacn opened 1 year ago

Abacn commented 1 year ago

What happened?

Reported from https://github.com/GoogleCloudPlatform/DataflowTemplates/pull/759

When implementing a load test for BigTableIO, we encountered the following:

Dataflow write pipeline logs say that 10M records were written. However, the read job shows only 1.6M records read.

Using the cbt utility, the cbt -instance count

command found out that BigTableIO write did not work correctly. Despite the fact that the logs say that all 10M records were written, in fact, there were exactly as many in the table as the read pipeline processed (1.6M). Some of the records processed by the write pipeline did not get into the table.

project: apache-beam-testing

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

Abacn commented 1 year ago

This is not related to Beam, it's DataflowTemplate test utility resource manager has wrong setting

The cause was found there: https://github.com/GoogleCloudPlatform/DataflowTemplates/pull/759#discussion_r1220523303

I find the cause is
- cell does not have set timestamp, so it default to epoch (1970-01-01)
- the createTable has a garbage collection policy of 1h, so large amount data written triggers GC and some records get deleted

We need to use `.setTimestampMicros(java.time.Instant.now().toEpochMilli()*1000)` for Mutation.SetCell

=========== (obsolete)

  • Tested with Beam 2.47.0, 2.48.0, table created with BigtableTableAdminClient.createTable, expected number of records (tested with 20M records and 100M records) (jobId: 2023-06-06_14_34_14-776136986672260899, 2023-06-06_15_02_28-18162755370264063675)
  • Tested with BigtableIOLT in DataflowTemplate, Beam 2.47.0, record missing (jobId: 2023-06-06_14_52_02-13679425791336453528)
  • Tested with BigtableIOLT in DataflowTemplate, Beam 2.47.0, table created with BigtableTableAdminClient.createTable, expected number of records (jobId: 2023-06-06_15_32_45-12170662821065212708)

For the job resulting in table missing, use cbt ... -instance <instance> count <table> found the number of records decreased half way writing.

Abacn commented 1 year ago

Turns out that this could also affect real usage case when Timestamp field is not set, reopen it and keep it as P1 also

Abacn commented 1 year ago

Posible solutions:

  • when incoming Timestamp is empty, default set to current time (instead of epoch)
  • when incoming Timestamp is empty, raise a load warning
Abacn commented 1 year ago

CC: @mutianf @ahmedabu98 (this also affects xlang Bigtable)

Abacn commented 1 year ago

per https://github.com/apache/beam/pull/28624#discussion_r1338601869 at least we should add some validation in write transform

kennknowles commented 8 months ago

Would the followup be P2 or still P1?

Abacn commented 8 months ago

This is due to user bug (incorrect/epoch) timestamp attached to the cell. The issue is kept open because there is follow up (add warning) can be done so kept P2, and update issue title