GoogleCloudDataproc / spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Apache License 2.0
378 stars 198 forks source link

BigNumeric precision is too wide (76), Spark can only handle decimal types with max precision of 38 #1234

Closed John15321 closed 5 months ago

John15321 commented 5 months ago

Hi, Im having a problem with Spark and Big Query, the table that I am trying to query using the connector contains BIGNUMERIC columns (mind that this happens no matter if I do a query for the specific column that is BIGNUMERIC or not, simply its existence in the schema of the whole table causes the issue to exist).

24/05/27 15:18:38 INFO SparkBigQueryConnectorModule: Registering cleanup jobs listener, should happen just once
Traceback (most recent call last):
  File "/Users/jbronicki/Work/makingnewrepo/myproject/src/myproj/myproj.py", line 31, in <module>
    .load()
     ^^^^^^
  File "/Users/jbronicki/Work/makingnewrepo/myproject/venv/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 314, in load
  File "/Users/jbronicki/Work/makingnewrepo/myproject/venv/lib/python3.11/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/Users/jbronicki/Work/makingnewrepo/myproject/venv/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 185, in deco
pyspark.errors.exceptions.captured.IllegalArgumentException: BigNumeric precision is too wide (76), Spark can only handle decimal types with max precision of 38

Here is the code:


from pyspark.sql import SparkSession

# Initialize Spark session
spark = (
    SparkSession.builder.appName("PySpark BigQuery Example")
    .config(
        "spark.jars.packages",
        "com.google.cloud.spark:spark-3.5-bigquery:0.39.0",
    )
    .getOrCreate()
)

# Define your project, dataset, and table details
project_id = "some-project"
dataset_id = "mydataset"
table_id = "mytable"

# Read data from BigQuery using the query
bq_data = (
    spark.read.format("bigquery")
    .option("project", project_id)
    .option("dataset", dataset_id)
    .option("table", table_id)
    .option("parentProject", project_id)
    .load()
)

# Create a temporary view for the DataFrame
bq_data.createOrReplaceTempView("bq_table")

# Perform some data processing using SQL query
processed_data = spark.sql(
    """
SELECT
  AccountID,
  Symbol,
  Direction,
FROM
  bq_table
"""
)

# Show processed data
processed_data.show()

# Stop the Spark session
spark.stop()

I know that this issue has been reported earlier but it is closed as of now, thus I decided to create a new one ( https://github.com/GoogleCloudDataproc/spark-bigquery-connector/issues/1208 ).

BIGNUMERIC | NULLABLE
davidrabinowitz commented 5 months ago

As mentioned in the error, the BIGNUMERIC data type is wider than Spark's Decimal. It means that even the table load cannot be performed as we cannot convert the BigQuery schema to the Spark one, regardless of the read fields.

As a workaround, you can do one of the following:

  1. Run a SQL on BigQuery: bq_data = spark.read.format("bigquery").load("SELECT AccountID, Symbol, Direction, FROM project_id.dataset_id.bq_table")
  2. Changing the BIGNUMERIC's field precision, to be at most 38.
John15321 commented 5 months ago

As mentioned in the error, the BIGNUMERIC data type is wider than Spark's Decimal. It means that even the table load cannot be performed as we cannot convert the BigQuery schema to the Spark one, regardless of the read fields.

As a workaround, you can do one of the following:

  1. Run a SQL on BigQuery: bq_data = spark.read.format("bigquery").load("SELECT AccountID, Symbol, Direction, FROM project_id.dataset_id.bq_table")
  2. Changing the BIGNUMERIC's field precision, to be at most 38.

Hmm, @davidrabinowitz Im having some problems with the first solution. Whenever I try to do that, I get an error saying that the dataset cannot be found:

from pyspark.sql import SparkSession

spark: SparkSession = (
    SparkSession.builder.appName("PySpark BigQuery Example")
    .config(
        "spark.jars.packages",
        "com.google.cloud.spark:spark-3.5-bigquery:0.39.0",
    )
    .getOrCreate()
)
spark.conf.set("viewsEnabled", "true")

project_id = "myproject"
dataset_id = "mydataset"
table_id = "mytable"

bq_data = (
    spark.read.format("bigquery")
    .load(
        f"SELECT AccountID, Symbol, Direction, FROM {project_id}.{dataset_id}.{table_id} WHERE partition_date = '2024-05-18'"
    )
)

bq_data.show()
spark.stop()

Error:

24/05/29 11:45:53 INFO SparkBigQueryConnectorModule: Registering cleanup jobs listener, should happen just once
Traceback (most recent call last):
  File "/Users/jbronicki/Work/my-project/src/myproj/myproj.py", line 21, in <module>
    .load(
     ^^^^^
  File "/Users/jbronicki/Work/my-project/venv/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 307, in load
  File "/Users/jbronicki/Work/my-project/venv/lib/python3.11/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/Users/jbronicki/Work/my-project/venv/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
  File "/Users/jbronicki/Work/my-project/venv/lib/python3.11/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o31.load.
: com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) [Guice/ErrorInCustomProvider]: IllegalArgumentException: 'dataset' not parsed or provided.
  at SparkBigQueryConnectorModule.configure(SparkBigQueryConnectorModule.java:70)
  at BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:106)
      \_ for 1st parameter
  at BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:106)
  while locating BigQueryClient

Learn more:
  https://github.com/google/guice/wiki/ERROR_IN_CUSTOM_PROVIDER

1 error

======================
Full classname legend:
======================
BigQueryClient:               "com.google.cloud.bigquery.connector.common.BigQueryClient"
BigQueryClientModule:         "com.google.cloud.bigquery.connector.common.BigQueryClientModule"
SparkBigQueryConnectorModule: "com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule"
========================
End of classname legend:
========================

        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProvisionException.toProvisionException(InternalProvisionException.java:251)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1104)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1139)
        at com.google.cloud.spark.bigquery.v2.Spark3Util.createBigQueryTableInstance(Spark3Util.java:48)
        at com.google.cloud.spark.bigquery.v2.Spark35BigQueryTableProvider.getBigQueryTableInternal(Spark35BigQueryTableProvider.java:33)
        at com.google.cloud.spark.bigquery.v2.Spark31BigQueryTableProvider.inferSchema(Spark31BigQueryTableProvider.java:40)
        at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:90)
        at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:140)
        at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:210)
        at scala.Option.flatMap(Option.scala:271)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:208)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
        at java.base/java.lang.reflect.Method.invoke(Method.java:580)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Thread.java:1570)
Caused by: java.lang.IllegalArgumentException: 'dataset' not parsed or provided.
        at com.google.cloud.bigquery.connector.common.BigQueryUtil.lambda$parseTableId$2(BigQueryUtil.java:184)
        at java.base/java.util.Optional.orElseThrow(Optional.java:403)
        at com.google.cloud.bigquery.connector.common.BigQueryUtil.parseTableId(BigQueryUtil.java:184)
        at com.google.cloud.spark.bigquery.SparkBigQueryConfig.from(SparkBigQueryConfig.java:329)
        at com.google.cloud.spark.bigquery.SparkBigQueryConfig.from(SparkBigQueryConfig.java:263)
        at com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule.lambda$provideSparkBigQueryConfig$0(SparkBigQueryConnectorModule.java:104)
        at java.base/java.util.Optional.orElseGet(Optional.java:364)
        at com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule.provideSparkBigQueryConfig(SparkBigQueryConnectorModule.java:102)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderInternalFactory.provision(ProviderInternalFactory.java:86)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalFactoryToInitializableAdapter.provision(InternalFactoryToInitializableAdapter.java:57)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderInternalFactory.circularGet(ProviderInternalFactory.java:60)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalFactoryToInitializableAdapter.get(InternalFactoryToInitializableAdapter.java:47)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingleParameterInjector.inject(SingleParameterInjector.java:40)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingleParameterInjector.getAll(SingleParameterInjector.java:60)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderMethod.doProvision(ProviderMethod.java:171)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.provision(InternalProviderInstanceBindingImpl.java:185)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.get(InternalProviderInstanceBindingImpl.java:162)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:169)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:45)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1101)
        ... 20 more

Even tho when I test this exact SQL it's working. Am I doing something wrong here or is the second option that you mentioned the only possible fix for this situation?

davidrabinowitz commented 5 months ago

Please add spark.conf.set("materializationDataset", "<SOME-DATASET>")

John15321 commented 5 months ago

I see, that would work. So If we want to keep using BigNumeric 76 we would have to create a materialization table in every location where we wanna query and a BIGNUMERIC 76 exists?

davidrabinowitz commented 5 months ago

Yes - as long as you do not query the BigNumeric itself.

Any reason to use such a wide decimal? I'd be happy to hear about your use case. You can also reach directly.

John15321 commented 5 months ago

Well...at some point someone just set it as such and here we are trying to use it 🤷‍♂️🤣