GoogleCloudDataproc / spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Apache License 2.0
377 stars 197 forks source link

Cannot query views with FILTER and SELECT operators, throws invalid_argument exception #354

Closed sandeep-katta0102 closed 3 years ago

sandeep-katta0102 commented 3 years ago

When I run viewdf.filter.select it throws INVALID_ARGUMENT: request failed: Row filter for table xxxx:tempdataset._sbc_79a9cc7eb80c419f94763550c7ead249 is invalid. Filter is '(date = TIMESTAMP '2021-03-29 17:00:00.0' AND date IS NOT NULL)'

Code


  val viewName = "XXXXX.tempdataset.covidconfirmview"
  val projectId = "XXXX"
  val viewDf = spark.read.format("bigquery").option("project", projectId).option("parentProject", projectId)
    .option("viewsEnabled", "true").option("table",viewName).load()

  viewDf.filter($"date" === "2021-03-29 17:00:00").show() // this works

  viewDf.filter($"date" === "2021-03-29 17:00:00").select("country").show() // throws Exception

I have created the view using the public dataset bigquery-public-data.covid19_italy.data_by_province

Saved the output of below results as view in BigQuery

SELECT
   *
FROM
  `bigquery-public-data.covid19_italy.data_by_province` covid19
WHERE
  EXTRACT(date from DATE) = DATE_SUB(CURRENT_DATE(),INTERVAL 1 day) 
ORDER BY
  confirmed_cases desc

Logs

21/03/30 10:10:32 INFO DirectBigQueryRelation: Querying table XXXXX.tempdataset.covidconfirmview, parameters sent from Spark: requiredColumns=[country], filters=[isnotnull(date#515),(date#515 = 1617037200000000)] 21/03/30 10:10:34 INFO DirectBigQueryRelation: Going to read from XXXXX.tempdataset._sbc_79a9cc7eb80c419f94763550c7ead249 columns=[country], filter='(date = TIMESTAMP '2021-03-29 17:00:00.0' AND date IS NOT NULL)' 21/03/30 10:10:34 INFO DirectBigQueryRelation: unhandledFilters: 21/03/30 10:10:34 INFO DirectBigQueryRelation: Querying table XXXXX.tempdataset.covidconfirmview, parameters sent from Spark: requiredColumns=[country], filters=[isnotnull(date#515),(date#515 = 1617037200000000)] 21/03/30 10:10:34 INFO DirectBigQueryRelation: Going to read from XXXXX.tempdataset._sbc_79a9cc7eb80c419f94763550c7ead249 columns=[country], filter='(date = TIMESTAMP '2021-03-29 17:00:00.0' AND date IS NOT NULL)' 21/03/30 10:10:34 WARN SQLExecution: Error executing delta metering com.google.api.gax.rpc.InvalidArgumentException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: request failed: Row filter for table XXXXX:tempdataset._sbc_79a9cc7eb80c419f94763550c7ead249 is invalid. Filter is '(date = TIMESTAMP '2021-03-29 17:00:00.0' AND date IS NOT NULL)'

sandeep-katta0102 commented 3 years ago

@davidrabinowitz could you please look into this, it is becoming seemingly difficult to query views using BigQuery Connector

himanshukohli09 commented 3 years ago

Hi Sandeep,

Please try the following query:

viewDf.filter($"date" === "2021-03-29 17:00:00").select("country", "date").show()

sandeep-katta0102 commented 3 years ago

@himanshukohli09

Thanks for the workaround, is it a bug or limitation ?

himanshukohli09 commented 3 years ago

This is currently the limitation. In case of views the fields in the filter should also be part of select.

sandeep-katta0102 commented 3 years ago

Could we document this limitation, I don't find any referrence which says as a limitation

https://github.com/GoogleCloudDataproc/spark-bigquery-connector#reading-data-from-a-bigquery-query

himanshukohli09 commented 3 years ago

Hi Sandeep,

Sure, I will add this limitation in the document(readme). Also, the reason for this limitation is as follows:

As mentioned here, the views have to be materialized before reading them. The materialized view is created with all the columns but if there is a select operator present then the materialized view is created with these columns only. Now, in this particular case the materialized view is created with a single column "country". As a result, when after materialization the filter operator is applied on column "date" it throws error saying "INVALID_ARGUMENT" as "date" column in not present in the materialized view.

Hence, in case of views if we want to use both select and filter then the columns on which filter is applied those columns should also be present in select.

himanshukohli09 commented 3 years ago

Hi Sandeep,

I have corrected this limitation and the PR has been merged.