GoogleCloudDataproc / flink-bigquery-connector

BigQuery integration to Apache Flink's Table API
Apache License 2.0
15 stars 11 forks source link

_PARTITIONTIME do not appear on TableSchema, causing an Exception on `streamAvros` #133

Closed antonioc-ps closed 1 month ago

antonioc-ps commented 3 months ago

BigQuery has a concept of _PARTITIONTIME for Partitioned Table that describes a pseudo column that contains the UTC day that the partitioned data was loaded.

When the method BigQuerySource.streamAvros is used, internally it runs the following:

        BigQueryConnectOptions connectOptions = readOptions.getBigQueryConnectOptions();
        TableSchema tableSchema =
                BigQueryServicesFactory.instance(connectOptions)
                        .queryClient()
                        .getTableSchema(
                                connectOptions.getProjectId(),
                                connectOptions.getDataset(),
                                connectOptions.getTable());

Unfortunately the tableSchema that is returned doesn't contain this pseudocolumn _PARTITIONTIME.

The streamAvro method follows the below chain of methods calls UnboundedSplitAssigner.discoverNewSplits -> BigQueryServicesImpl.retrievePartitionsStatus -> BigQueryServicesImpl.retrievePartitionColumnInfo -> BigQueryPartitionUtils.retrievePartitionColumnType

this last method throw an exception because _PARTITIONTIME column doesn't appear in the

Table tableInfo = BigQueryUtils.tableInfo(bigquery, project, dataset, table).getSchema()

jayehwhyehentee commented 1 month ago

Thanks for pointing out @antonioc-ps. We have observed several issues with the current unbounded source implementation and will re-write it completely in the future. As a precautionary measure, the connector's unbounded source feature will be withdrawn soon since the current offering is simply incorrect.

jayehwhyehentee commented 1 month ago

Closing for now.