apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.81k stars 4.23k forks source link

[Bug]: Cannot read partitions metadata table from bigquery #24726

Open TSienki opened 1 year ago

TSienki commented 1 year ago

What happened?

Hello, I wanted to read partitions metadata from bigquery table project_id.dataset_id.INFORMATION_SCHEMA.PARTITIONS using ReadFromBigQuery. Unfortunately, this function raises an error:

RuntimeError: apitools.base.py.exceptions.HttpNotFoundError: HttpError accessing <https://bigquery.googleapis.com/bigquery/v2/projects/[project_id]/jobs?alt=json>: response: <{'vary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'date': 'Mon, 19 Dec 2022 17:02:59 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'status': '404', 'content-length': '388', '-content-encoding': 'gzip'}>, content <{
  "error": {
    "code": 404,
    "message": "Not found: Dataset [project_id]:`[table_name] was not found in location EU",
    "errors": [
      {
        "message": "Not found: Dataset [project_id]:`[table_name] was not found in location EU",
        "domain": "global",
        "reason": "notFound"
      }
    ],
    "status": "NOT_FOUND"
  }
}
> [while running '[11]: FindPreviousPartitionDate/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']

The part of code that causes the error:

"FindPreviousPartitionDate" >> beam.io.ReadFromBigQuery(
    query="SELECT * FROM `[project_id].[dataset_id].INFORMATION_SCHEMA.PARTITIONS`",
    use_standard_sql=True,
    flatten_results=False
)

I replaced my actual project id and dataset id with tokens [project_id], [dataset_id]. I've tested it with beam versions 2.36.0 and 2.43.0 using direct and dataflow runners. Also tried running it with different argument values like method, or use_standard_sql, but it doesn't help.

Do you know if it is possible to read from this table using ReadFromBigQuery

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

ding-qin commented 8 months ago

I used apache_beam 2.53.0 and this bug still exists. It's a common use case that metada of a BigQuery table is required in the query. Not only INFORMATION_SCHEMA but also _partitiondate is required in the many cases. For example, select _partitiondate as partition_name, * from xxxx The same query can be executed successfully using bigquery client library. The reason I want to do it using apache beam is because I want to convert to complex SQL joins to PTransform so that it can be executed in parallel to improve the performance. However, I have to work around it for now. Hopefully it can be fixed as soon as possible.