GoogleCloudDataproc / spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Apache License 2.0
378 stars 198 forks source link

Issue with Map Type support (pyspark) #1278

Closed violetautumn closed 3 months ago

violetautumn commented 3 months ago

Hello,

As per the v0.38 release doc for the BQ-Spark connector, there is support for map type.

We were doing some functionality testing for a customer, and I tried to write the following map-type dataframe into BQ using the connector (using pyspark):

schema = StructType([
   StructField("name", StringType(), True),
   StructField("age-nested", MapType(IntegerType(), IntegerType(), True))])

data = [('Alice', {1:1,2:2}),('Bob', {3:3,4:4})]

df = spark.createDataFrame(data, schema)

For reference, following is the schema as printed by df.printSchema():

image

However, when I try to write it to BQ I am getting the error that maptype is not supported: java.lang.IllegalArgumentException: MapType is unsupported.

For reference, I am using the latest v0.40 connector with spark version 3.5 at the GCS location gs://spark-lib/bigquery/spark-3.5-bigquery-0.40.0.jar, and following is the df.write() statement that I am using:

df.write \
    .format("com.google.cloud.spark.bigquery") \
    .option("writeMethod", "direct") \
    .mode("append") \
    .save("sample_dataset.map_data3")

I wanted to ask if map-type only supported for scala or if the support is also extended for java using pyspark? And in case it is, can you please help resolving the issue as to how to perform the writes for map-type data?

davidrabinowitz commented 3 months ago

Please refer to the restrictions on Map types as listed in the documentation, specifically

Keys can be Strings only

In the code sample above the keys are of IntegerType.