Closed vishalkarve15 closed 1 month ago
/gcbrun
/gcbrun
/gcbrun
We have 2 options here:
I'd prefer the 2nd option which makes it a breaking change, because it is the intended behavior of require partition filter.
Or option 3: if requirePartitionFilter is true, run SELECT COUNT(*) FROM <table> WHERE <partition_column> BETWEEN <min-val> AND <max-val>
where the min/max values are based on the type - MIN_LONG/MAX_LONG for range partitions, 0001-01-01
/ 9999-12-31
for date partitions. Perhaps <partition_column> != null
will work as well.
/gcbrun
@davidrabinowitz wouldn't option 2 make more sense? Consider the dataframe that's querying a table with requirePartitionFilter = true.
Dataset<Row> readDF = spark.read().format("bigquery").load(table);
readDF.count(); // succeeds but ends up running count(*) on the entire table
readDF.collect(); // fails due to require partition filter
I think the purpose of require partition filter is for users to be able to avoid running expensive queries against the whole table. This seems to be enabling that behavior since we're changing it at the query level.
Existing behavior for Biglake tables is to error out. By having BQ native tables error out too, we can make this behavior consistent.
/gcbrun
/gcbrun
As discussed offline, will go ahead with option 3 since we can't introduce a breaking change.
/gcbrun
TableInfo.getNumRows() only covers the data that's materialized to data files, without including the data in streaming buffer. The data in streaming buffer (a.k.a, ingested through Write API, but not converted to data file yet) is readable using ReadAPI, but accessing TableInfo.getNumRows() won't reflect whatever changes are made in the streaming buffer. Running a select count(1) query works, as it triggers actual data scan through ReadAPI.
This issue only affects direct write.