GoogleCloudDataproc / spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Apache License 2.0
378 stars 198 forks source link

Issue #1290: Stopped using metadata for optimized count path #1293

Closed vishalkarve15 closed 1 month ago

vishalkarve15 commented 2 months ago

TableInfo.getNumRows() only covers the data that's materialized to data files, without including the data in streaming buffer. The data in streaming buffer (a.k.a, ingested through Write API, but not converted to data file yet) is readable using ReadAPI, but accessing TableInfo.getNumRows() won't reflect whatever changes are made in the streaming buffer. Running a select count(1) query works, as it triggers actual data scan through ReadAPI.

This issue only affects direct write.

vishalkarve15 commented 2 months ago

/gcbrun

vishalkarve15 commented 2 months ago

/gcbrun

vishalkarve15 commented 2 months ago

/gcbrun

vishalkarve15 commented 2 months ago

We have 2 options here:

I'd prefer the 2nd option which makes it a breaking change, because it is the intended behavior of require partition filter.

davidrabinowitz commented 2 months ago

Or option 3: if requirePartitionFilter is true, run SELECT COUNT(*) FROM <table> WHERE <partition_column> BETWEEN <min-val> AND <max-val> where the min/max values are based on the type - MIN_LONG/MAX_LONG for range partitions, 0001-01-01 / 9999-12-31 for date partitions. Perhaps <partition_column> != null will work as well.

vishalkarve15 commented 1 month ago

/gcbrun

vishalkarve15 commented 1 month ago

@davidrabinowitz wouldn't option 2 make more sense? Consider the dataframe that's querying a table with requirePartitionFilter = true.

Dataset<Row> readDF = spark.read().format("bigquery").load(table);
readDF.count(); // succeeds but ends up running count(*) on the entire table
readDF.collect(); // fails due to require partition filter

I think the purpose of require partition filter is for users to be able to avoid running expensive queries against the whole table. This seems to be enabling that behavior since we're changing it at the query level.

Existing behavior for Biglake tables is to error out. By having BQ native tables error out too, we can make this behavior consistent.

vishalkarve15 commented 1 month ago

/gcbrun

vishalkarve15 commented 1 month ago

/gcbrun

vishalkarve15 commented 1 month ago

As discussed offline, will go ahead with option 3 since we can't introduce a breaking change.

vishalkarve15 commented 1 month ago

/gcbrun