How to handle column datatype change?

nandinir-db commented 7 months ago

Getting the below error when trying to read a column whose datatype changed.

UnsupportedOperationException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 58.0 failed 4 times, most recent failure: Lost task 0.3 in stage 58.0 (TID 67) (10.88.253.240 executor 0): java.lang.UnsupportedOperationException at com.google.cloud.spark.bigquery.ArrowSchemaConverter$ArrowVectorAccessor.getLong(ArrowSchemaConverter.java:297) at com.google.cloud.spark.bigquery.ArrowSchemaConverter.getLong(ArrowSchemaConverter.java:98) at org.apache.spark.sql.vectorized.ColumnarBatchRow.getLong(ColumnarBatchRow.java:120) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

1) Read the table with query option %scala val df = spark.read.format("bigquery") .option("viewsEnabled", "true") .option("materializationDataset","xx") .option("parentProject", "xx") .option("query", "select xxLastTrialStartDate, updatedDate from xx.xx.xx limit 1") .load()

%scala df.select("xxLastTrialStartDate", "updatedDate").printSchema() root |-- xxLastTrialStartDate: timestamp (nullable = true) |-- updatedDate: string (nullable = true) 2) Read the table with table option %scala val xxTable = spark.read .format("bigquery") .option("viewsEnabled", "true") .option("parentProject", "xx") .option("project", "xx") .option("dataset", "xx") .option("table", "xx") .load()

%scala xxTable.select("xxLastTrialStartDate", "updatedDate").printSchema() root |-- xxLastTrialStartDate: timestamp (nullable = true) |-- updatedDate: timestamp (nullable = true) Note: (A) I tried performing a cast to string but that did not work either, because it still tries to read the underlying data as timestamp in the query plan display(xxTable.select($"updatedDate".cast("string")))

UnsupportedOperationException: Plan == Physical Plan == CollectLimit 10001 +- (1) Project [cast(updatedDate#1546 as string) AS updatedDate#1693] +- (1) Scan BigQueryRelation(xx.xx.xx numRows=0 numBytes=0 ) [updatedDate#1546] PushedFilters: [], ReadSchema: struct

== Analyzed Logical Plan == updatedDate: string (B) Issue is not with timestamp type as we can see the value being returned for xxLastTrialStartDate display(xxTable.select("xxLastTrialStartDate").where("Id=541577")) xxLastTrialStartDate 2009-03-27T00:00:00Z

display(xxTable.select("updatedDate").where("Id=541577")) UnsupportedOperationException:

davidrabinowitz commented 7 months ago

Can you please share the table schema?

nandinir-db commented 7 months ago

Hello @davidrabinowitz , Thanks for looking into this. Sorry for being late here. The current schema is has a lot many fields but the column with issue is updated_date (able to read all other columns). updated_date may have changed over time but I am not certain on that. { "type": "struct", "fields": [{ "name": "updated_date", "type": "timestamp", "nullable": true, "metadata": {} }, { "name": "created_timestamp", "type": "string", "nullable": true, "metadata": {} }] }

davidrabinowitz commented 7 months ago

Thanks, can you please share the schema of the BigQuery table as well (the relevant fields at least) ?

nchammas commented 7 months ago

I believe the updatedDate TIMESTAMP column is the only relevant column, as it's the one triggering this problem. It comes from a view that queries some underlying table that we cannot access.

davidrabinowitz commented 7 months ago

Based on a similar case we had in the past - please check the schema changes of the underlining table.

nchammas commented 7 months ago

Here is the schema, as shared with me by the admin of the BigQuery instance we are querying:

trial.json

nchammas commented 6 months ago

@davidrabinowitz - Is there any other information I can provide to make this a useful bug report?

isha97 commented 5 months ago

@nchammas What is the connector version that you are using? Please use the latest version 0.36.3 and share the full stack trace of the error.

isha97 commented 5 months ago

@nchammas @nandinir-db Please try to use the latest connector version and let us know if you are still facing the issue.

GoogleCloudDataproc / spark-bigquery-connector

How to handle column datatype change? #1210