Closed John15321 closed 5 months ago
As mentioned in the error, the BIGNUMERIC data type is wider than Spark's Decimal. It means that even the table load cannot be performed as we cannot convert the BigQuery schema to the Spark one, regardless of the read fields.
As a workaround, you can do one of the following:
bq_data = spark.read.format("bigquery").load("SELECT AccountID, Symbol, Direction, FROM project_id.dataset_id.bq_table")
As mentioned in the error, the BIGNUMERIC data type is wider than Spark's Decimal. It means that even the table load cannot be performed as we cannot convert the BigQuery schema to the Spark one, regardless of the read fields.
As a workaround, you can do one of the following:
- Run a SQL on BigQuery:
bq_data = spark.read.format("bigquery").load("SELECT AccountID, Symbol, Direction, FROM project_id.dataset_id.bq_table")
- Changing the BIGNUMERIC's field precision, to be at most 38.
Hmm, @davidrabinowitz Im having some problems with the first solution. Whenever I try to do that, I get an error saying that the dataset cannot be found:
from pyspark.sql import SparkSession
spark: SparkSession = (
SparkSession.builder.appName("PySpark BigQuery Example")
.config(
"spark.jars.packages",
"com.google.cloud.spark:spark-3.5-bigquery:0.39.0",
)
.getOrCreate()
)
spark.conf.set("viewsEnabled", "true")
project_id = "myproject"
dataset_id = "mydataset"
table_id = "mytable"
bq_data = (
spark.read.format("bigquery")
.load(
f"SELECT AccountID, Symbol, Direction, FROM {project_id}.{dataset_id}.{table_id} WHERE partition_date = '2024-05-18'"
)
)
bq_data.show()
spark.stop()
Error:
24/05/29 11:45:53 INFO SparkBigQueryConnectorModule: Registering cleanup jobs listener, should happen just once
Traceback (most recent call last):
File "/Users/jbronicki/Work/my-project/src/myproj/myproj.py", line 21, in <module>
.load(
^^^^^
File "/Users/jbronicki/Work/my-project/venv/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 307, in load
File "/Users/jbronicki/Work/my-project/venv/lib/python3.11/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/Users/jbronicki/Work/my-project/venv/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
File "/Users/jbronicki/Work/my-project/venv/lib/python3.11/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o31.load.
: com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors:
1) [Guice/ErrorInCustomProvider]: IllegalArgumentException: 'dataset' not parsed or provided.
at SparkBigQueryConnectorModule.configure(SparkBigQueryConnectorModule.java:70)
at BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:106)
\_ for 1st parameter
at BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:106)
while locating BigQueryClient
Learn more:
https://github.com/google/guice/wiki/ERROR_IN_CUSTOM_PROVIDER
1 error
======================
Full classname legend:
======================
BigQueryClient: "com.google.cloud.bigquery.connector.common.BigQueryClient"
BigQueryClientModule: "com.google.cloud.bigquery.connector.common.BigQueryClientModule"
SparkBigQueryConnectorModule: "com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule"
========================
End of classname legend:
========================
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProvisionException.toProvisionException(InternalProvisionException.java:251)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1104)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1139)
at com.google.cloud.spark.bigquery.v2.Spark3Util.createBigQueryTableInstance(Spark3Util.java:48)
at com.google.cloud.spark.bigquery.v2.Spark35BigQueryTableProvider.getBigQueryTableInternal(Spark35BigQueryTableProvider.java:33)
at com.google.cloud.spark.bigquery.v2.Spark31BigQueryTableProvider.inferSchema(Spark31BigQueryTableProvider.java:40)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:90)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:140)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:210)
at scala.Option.flatMap(Option.scala:271)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:208)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:1570)
Caused by: java.lang.IllegalArgumentException: 'dataset' not parsed or provided.
at com.google.cloud.bigquery.connector.common.BigQueryUtil.lambda$parseTableId$2(BigQueryUtil.java:184)
at java.base/java.util.Optional.orElseThrow(Optional.java:403)
at com.google.cloud.bigquery.connector.common.BigQueryUtil.parseTableId(BigQueryUtil.java:184)
at com.google.cloud.spark.bigquery.SparkBigQueryConfig.from(SparkBigQueryConfig.java:329)
at com.google.cloud.spark.bigquery.SparkBigQueryConfig.from(SparkBigQueryConfig.java:263)
at com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule.lambda$provideSparkBigQueryConfig$0(SparkBigQueryConnectorModule.java:104)
at java.base/java.util.Optional.orElseGet(Optional.java:364)
at com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule.provideSparkBigQueryConfig(SparkBigQueryConnectorModule.java:102)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderInternalFactory.provision(ProviderInternalFactory.java:86)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalFactoryToInitializableAdapter.provision(InternalFactoryToInitializableAdapter.java:57)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderInternalFactory.circularGet(ProviderInternalFactory.java:60)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalFactoryToInitializableAdapter.get(InternalFactoryToInitializableAdapter.java:47)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingleParameterInjector.inject(SingleParameterInjector.java:40)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingleParameterInjector.getAll(SingleParameterInjector.java:60)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderMethod.doProvision(ProviderMethod.java:171)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.provision(InternalProviderInstanceBindingImpl.java:185)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.get(InternalProviderInstanceBindingImpl.java:162)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:169)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:45)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1101)
... 20 more
Even tho when I test this exact SQL it's working. Am I doing something wrong here or is the second option that you mentioned the only possible fix for this situation?
Please add spark.conf.set("materializationDataset", "<SOME-DATASET>")
I see, that would work. So If we want to keep using BigNumeric 76 we would have to create a materialization table in every location where we wanna query and a BIGNUMERIC 76 exists?
Yes - as long as you do not query the BigNumeric itself.
Any reason to use such a wide decimal? I'd be happy to hear about your use case. You can also reach directly.
Well...at some point someone just
set it as such and here we are trying to use it 🤷♂️🤣
Hi, Im having a problem with Spark and Big Query, the table that I am trying to query using the connector contains
BIGNUMERIC
columns (mind that this happens no matter if I do a query for the specific column that isBIGNUMERIC
or not, simply its existence in the schema of the whole table causes the issue to exist).Here is the code:
I know that this issue has been reported earlier but it is closed as of now, thus I decided to create a new one ( https://github.com/GoogleCloudDataproc/spark-bigquery-connector/issues/1208 ).