databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

Incorrect inferring schema if ignoreNamespace is true and namespace = tag #671

Closed hipp0gryph closed 6 months ago

hipp0gryph commented 9 months ago

Hello! I have problem with inferring schema, if I use parameter ignoreNamespace. I use driver version 0.16.0 and 0.17.0 with Pyspark 3.1.1 and on 3.3.0. If my namespace and one of blocks have identical names - I get that incorrect inferring schema. In my xml that block and namespace have name test. Code:

main_tag="ЭДПФР"
charset="utf-8"

df = spark.read.format("xml") \
    .option("rowTag", main_tag) \
    .option("attributePrefix", "") \
    .option("mode", "PERMISSIVE") \
    .option("charset", charset) \
    .option("inferSchema", False) \
    .option("ignoreNamespace", True) \
    .load("test.xml")
df.printSchema()

Schema: image

But if I set ignoreNamespace on False. image

Thank you in advance!

My xml. Sorry for another language :(

<ЭДПФР xmlns:АФ="http://test.test/test" xmlns:test="http://test.test/test/test/test"
       xmlns="http://test.test/test/test/test/test/test">
    <test>
        <test:ПроверяемыйДокумент КодФормы="test">
            <test:Файл ИмяФайла="test"/>
        </test:ПроверяемыйДокумент>
        <test:ПроверочныйМодуль ДатаВремяНачала="2023-04-26T09:32:00.093+03:00"
                               ДатаВремяОкончания="2023-04-26T09:32:00.093+03:00" Наименование="test">
            <test:ПроверкаФайлов>
                <test:Файл
                        ИмяФайла="test">
                    <test:Результат>
                        <test:БлокПроверок Название="Проверка файла на соответствие xsd-схеме">
                            <test:Проверка ID="test">
                                <test:ОписаниеПроверки>test
                                </test:ОписаниеПроверки>
                                <test:КодРезультата>50</test:КодРезультата>
                            </test:Проверка>
                        </test:БлокПроверок>
                    </test:Результат>
                </test:Файл>
            </test:ПроверкаФайлов>
        </test:ПроверочныйМодуль>
    </test>
    <СлужебнаяИнформация>
        <АФ:GUID>test</АФ:GUID>
        <АФ:ДатаВремя>2023-04-26T09:32:00.093+03:00</АФ:ДатаВремя>
    </СлужебнаяИнформация>
</ЭДПФР>
srowen commented 9 months ago

I'm not sure that's valid? why not just rename your namespace to not conflict?

hipp0gryph commented 9 months ago

I'm not sure that's valid? why not just rename your namespace to not conflict?

Thank you for answer! Yes, that's valid. I check that in IDE and special xml validator. In web I not find info about limitations on same name in tag and namespace. And that work correct if I not use ignoreNamespace.

I have 10 millions xml files now in that format and I don't know how much files in summary created on server with that tag. Temporarily I will use ignoreNamespace: False.

I have many xml files with another schema types and I think in future I can get the same error. Maybe driver have another methods for resolve problems with names? I can rename namespaces in options param maybe?

Thank you in advance!

srowen commented 9 months ago

Yeah, if you can't change the namespace, then just don't ignore it. It's not ambiguous if you retain the namespace

hipp0gryph commented 9 months ago

Yeah, if you can't change the namespace, then just don't ignore it. It's not ambiguous if you retain the namespace

In future namespaces can have changes in my files. Example: xmlns:AF2, xmlns:AF4, xmlns:AF5. Yes, that strange, but that files sends from government portal and I not can change it. My namespace tags can be not static. Example: AF2:Name, AF4:Name. I will need logic for getting and substituting namespaces for resolve it without ignoreNamespace. If that error was fixed in future, and I can use ignoreNamespace, I will be happy! Thank you!)