Issue #1290: Stopped using metadata for optimized count path

GoogleCloudDataproc / spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.

Apache License 2.0

378 stars 198 forks source link

Issue #1290: Stopped using metadata for optimized count path #1293

Closed vishalkarve15 closed 1 month ago

vishalkarve15 commented 2 months ago

TableInfo.getNumRows() only covers the data that's materialized to data files, without including the data in streaming buffer. The data in streaming buffer (a.k.a, ingested through Write API, but not converted to data file yet) is readable using ReadAPI, but accessing TableInfo.getNumRows() won't reflect whatever changes are made in the streaming buffer. Running a select count(1) query works, as it triggers actual data scan through ReadAPI.

This issue only affects direct write.

vishalkarve15 commented 2 months ago

/gcbrun

vishalkarve15 commented 2 months ago

/gcbrun

vishalkarve15 commented 2 months ago

/gcbrun

vishalkarve15 commented 2 months ago

We have 2 options here:

fallback to TableInfo for requirePartitionFilter tables. But this preserves existing behavior with potential for incorrect count.
use count(*) for requirePartitionFilter tables which fails if partition filter is not specified. This is a breaking change.

I'd prefer the 2nd option which makes it a breaking change, because it is the intended behavior of require partition filter.

davidrabinowitz commented 2 months ago

Or option 3: if requirePartitionFilter is true, run SELECT COUNT(*) FROM <table> WHERE <partition_column> BETWEEN <min-val> AND <max-val> where the min/max values are based on the type - MIN_LONG/MAX_LONG for range partitions, 0001-01-01 / 9999-12-31 for date partitions. Perhaps <partition_column> != null will work as well.