databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

The problem with the case of words for identical names #685

Open hipp0gryph opened 3 months ago

hipp0gryph commented 3 months ago

Hello! If I load files with identical names, but different letter case - I'm getting an error. But I wish get NULL string or two columns with different letter case in schema. I think it's logical.

Code:

spark = SparkSession.builder \
    .appName("Read XML") \
    .config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.18.0")\
    .getOrCreate()

df = spark.read.format("xml") \
    .option("rowTag", "Root") \
    .option("attributePrefix", "") \
    .option("mode", "PERMISSIVE") \
    .option("charset", "utf-8") \
    .option("inferSchema", False) \
    .option("ignoreNamespace", False) \
    .load(f"case_test/*.xml")
df.printSchema()

xml 1 for folder case_test:

<Root>
    <Element>Block for case switch</Element>
</Root>

xml 2 for folder case_test:

<Root>
    <ElemenT>Block for case switch</ElemenT>
</Root>

Error:

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
<ipython-input-2-b867e6c5fcd7> in <module>
    348     .option("inferSchema", False) \
    349     .option("ignoreNamespace", False) \
--> 350     .load(f"case_test/*.xml")
    351 df.printSchema()
    352 init_new_spark_df_methods()

/usr/local/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
    202         self.options(**options)
    203         if isinstance(path, str):
--> 204             return self._df(self._jreader.load(path))
    205         elif path is not None:
    206             if type(path) != list:

/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    115                 # Hide where the exception came from that shows a non-Pythonic
    116                 # JVM exception message.
--> 117                 raise converted from None
    118             else:
    119                 raise

AnalysisException: Found duplicate column(s) in the data schema: `element`

Thank you in advice!

srowen commented 3 months ago

Hm, I actually don't even know if that's 'correct' behavior or not. Spark is not case sensitive but XML is. You're welcome to investigate and come up with an argument about what it should do and see if the schema inference can be changed. I just don't want to break any existing behavior over this as it's operated this way forever. But making something work that never worked could be OK.

hipp0gryph commented 3 months ago

Hm, I actually don't even know if that's 'correct' behavior or not. Spark is not case sensitive but XML is. You're welcome to investigate and come up with an argument about what it should do and see if the schema inference can be changed. I just don't want to break any existing behavior over this as it's operated this way forever. But making something work that never worked could be OK.

Thank you for fast answer! Into w3 doc about xml: https://www.w3.org/TR/xml/#dt-entref We see into 4.3.3 Character Encoding in Entities: XML processors should match character encoding names in a case-insensitive way and should either interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are, of course, not required to support all IANA-registered encodings).

I think the right way is to read entity's with different case as the same.

hipp0gryph commented 3 months ago

I also doubted that is the same entity's)