GoogleCloudDataproc / spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Apache License 2.0
378 stars 198 forks source link

Load failure caused by comment at top of query string (llegalArgumentException: Invalid Table ID) #1182

Closed bonartm closed 7 months ago

bonartm commented 10 months ago

Tested in pyspark version 3.3.0 and with spark-bigquery-latest_2.12.jar

spark.conf.set("materializationProject", "<my-project>")
spark.conf.set("materializationDataset", "<my-dataset>")
spark.conf.set("viewsEnabled", True)

query = """
    # just some comment
    SELECT *
    FROM `bigquery-public-data.samples.shakespeare`
    LIMIT 10
"""
spark.read.format("bigquery").load(query)

produces the following error

java.lang.IllegalArgumentException: Invalid Table ID '# inline comment SELECT * FROM ``bigquery-public-data.samples.shakespeare` LIMIT 10'. Must match '^(((\S+)[:.])?(\w+)\.)?([\S&&[^.:]]+)$$'
        at com.google.cloud.bigquery.connector.common.BigQueryUtil.parseTableId(BigQueryUtil.java:160)
        at com.google.cloud.spark.bigquery.SparkBigQueryConfig.from(SparkBigQueryConfig.java:268)
        at com.google.cloud.spark.bigquery.SparkBigQueryConfig.from(SparkBigQueryConfig.java:204)
        at com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule.lambda$provideSparkBigQueryConfig$0(SparkBigQueryConnectorModule.java:79)
        at java.base/java.util.Optional.orElseGet(Optional.java:364)
        at com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule.provideSparkBigQueryConfig(SparkBigQueryConnectorModule.java:77)
        at com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule$$FastClassByGuice$$1865852.GUICE$TRAMPOLINE(<generated>)
        at com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule$$FastClassByGuice$$1865852.apply(<generated>)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderMethod$FastClassProviderMethod.doProvision(ProviderMethod.java:260)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderMethod.doProvision(ProviderMethod.java:171)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.provision(InternalProviderInstanceBindingImpl.java:185)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.get(InternalProviderInstanceBindingImpl.java:162)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:169)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:45)
        at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1101)
        ... 21 more

Running the same with

spark.read.format("bigquery").option("query", query).load()

works without issues. Removing the comment also resolves the error. This is relevant as some sql linter (like sqlfluff) accept options as inline comments at the top of the sql script, e.g.

-- sqlfluff:max_line_length:120
SELECT *
FROM `bigquery-public-data.samples.shakespeare`
LIMIT 10
vishalkarve15 commented 7 months ago

Fixed in 0.37.0