databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

Generated files does not have .xml extension #664

Closed dolfinus closed 10 months ago

dolfinus commented 10 months ago

Hi.

I've created simple dataframe:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, TimestampType
from datetime import datetime, timezone

spark = SparkSession.builder.config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.17.0").getOrCreate()
schema = StructType([StructField("created-at", TimestampType())])

df = spark.createDataFrame([{"created-at": datetime.now(tz=timezone.utc)}], schema=schema)
df.show(10, False)

df.write.format("xml").option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSSXXX").mode("overwrite").save("2.xml")
created-at
2023-10-09 09:05:24.269352

Then saved it as xml:

df.repartition(1).write \
  .format("xml") \
  .mode("overwrite") \
  .option("compression", None) \
  .option("rowTag", "item") \
  .save("2.xml")

This is content of 2.xml folder:

> ls -la 2.xml
drwxr-xr-x  2 maxim maxim   84 окт  9 09:18 ./
drwxr-xr-x 19 maxim maxim 4096 окт  9 09:18 ../
-rw-r--r--  1 maxim maxim  156 окт  9 09:18 part-00000
-rw-r--r--  1 maxim maxim   12 окт  9 09:18 .part-00000.crc
-rw-r--r--  1 maxim maxim    0 окт  9 09:18 _SUCCESS
-rw-r--r--  1 maxim maxim    8 окт  9 09:18 ._SUCCESS.crc

File 2.xml/part-00000 has the following content:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ROWS>
    <item>
        <created-at>2023-10-09T09:05:24.269352Z</created-at>
    </item>
</ROWS>

But it does not have .xml extension. Is that an expected behavior?

srowen commented 10 months ago

It's expected. I don't know of a way to control this, and won't change it at this point (the library is now in Spark anyway)

dolfinus commented 10 months ago

I see, rdd.saveAsTextFile creates directory with files without extensions. I think it is worth mentioning in Readme.