Closed nandinir-db closed 5 months ago
Can you please share the table schema?
Hello @davidrabinowitz , Thanks for looking into this. Sorry for being late here. The current schema is has a lot many fields but the column with issue is updated_date (able to read all other columns). updated_date may have changed over time but I am not certain on that. { "type": "struct", "fields": [{ "name": "updated_date", "type": "timestamp", "nullable": true, "metadata": {} }, { "name": "created_timestamp", "type": "string", "nullable": true, "metadata": {} }] }
Thanks, can you please share the schema of the BigQuery table as well (the relevant fields at least) ?
I believe the updatedDate
TIMESTAMP
column is the only relevant column, as it's the one triggering this problem. It comes from a view that queries some underlying table that we cannot access.
Based on a similar case we had in the past - please check the schema changes of the underlining table.
Here is the schema, as shared with me by the admin of the BigQuery instance we are querying:
@davidrabinowitz - Is there any other information I can provide to make this a useful bug report?
@nchammas What is the connector version that you are using? Please use the latest version 0.36.3 and share the full stack trace of the error.
@nchammas @nandinir-db Please try to use the latest connector version and let us know if you are still facing the issue.
Getting the below error when trying to read a column whose datatype changed.
UnsupportedOperationException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 58.0 failed 4 times, most recent failure: Lost task 0.3 in stage 58.0 (TID 67) (10.88.253.240 executor 0): java.lang.UnsupportedOperationException at com.google.cloud.spark.bigquery.ArrowSchemaConverter$ArrowVectorAccessor.getLong(ArrowSchemaConverter.java:297) at com.google.cloud.spark.bigquery.ArrowSchemaConverter.getLong(ArrowSchemaConverter.java:98) at org.apache.spark.sql.vectorized.ColumnarBatchRow.getLong(ColumnarBatchRow.java:120) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
1) Read the table with query option %scala val df = spark.read.format("bigquery") .option("viewsEnabled", "true") .option("materializationDataset","xx") .option("parentProject", "xx") .option("query", "select xxLastTrialStartDate, updatedDate from xx.xx.xx limit 1") .load()
%scala df.select("xxLastTrialStartDate", "updatedDate").printSchema() root |-- xxLastTrialStartDate: timestamp (nullable = true) |-- updatedDate: string (nullable = true) 2) Read the table with table option %scala val xxTable = spark.read .format("bigquery") .option("viewsEnabled", "true") .option("parentProject", "xx") .option("project", "xx") .option("dataset", "xx") .option("table", "xx") .load()
%scala xxTable.select("xxLastTrialStartDate", "updatedDate").printSchema() root |-- xxLastTrialStartDate: timestamp (nullable = true) |-- updatedDate: timestamp (nullable = true) Note: (A) I tried performing a cast to string but that did not work either, because it still tries to read the underlying data as timestamp in the query plan display(xxTable.select($"updatedDate".cast("string")))
== Analyzed Logical Plan == updatedDate: string (B) Issue is not with timestamp type as we can see the value being returned for xxLastTrialStartDate display(xxTable.select("xxLastTrialStartDate").where("Id=541577")) xxLastTrialStartDate 2009-03-27T00:00:00Z
display(xxTable.select("updatedDate").where("Id=541577")) UnsupportedOperationException: