GeethanadhP / xml-avro

Convert XSD -> AVSC and XML -> AVRO
Apache License 2.0
36 stars 26 forks source link

Few complex XML elements are not getting recognised properly #21

Closed jkckiran closed 4 years ago

jkckiran commented 5 years ago

I have few elements of complexType in an XSD like below :

<xs:complexType name="Type">
        <xs:sequence>
            <xs:element maxOccurs="unbounded" minOccurs="0" name="Error" type="tns:InterfaceErrorType"/>
        </xs:sequence>
        <xs:attribute name="name" type="xs:string" use="required"/>
        <xs:attribute name="count" type="xs:integer" use="required"/>
    </xs:complexType>

    <xs:complexType name="ErrorType">
        <xs:simpleContent>
            <xs:extension base="xs:string">
                <xs:attribute name="description" type="xs:string" use="required"/>
            </xs:extension>
        </xs:simpleContent>
    </xs:complexType>

And the corresponding avsc output is provided below for your reference

{
    "name" : "sample1",
    "type" : [ "null", {
      "type" : "array",
      "items" : {
        "type" : "record",
        "name" : "Type",
        "fields" : [ {
          "name" : "Error",
          "type" : [ "null", {
            "type" : "array",
            "items" : {
              "type" : "record",
              "name" : "ErrorType",
              "fields" : [ {
                "name" : "text_value",
                "type" : [ "null", "string" ],
                "source" : "element text_value"
              }, {
                "name" : "_code",
                "type" : "string",
                "source" : "attribute code"
              } ]
            }
          } ],
          "default" : null,
          "source" : "element Error"
        }, {
          "name" : "_name",
          "type" : "string",
          "source" : "attribute name"
        }, {
          "name" : "_count",
          "type" : "string",
          "source" : "attribute count"
        } ]
      }
    } ],
    "default" : null,
    "source" : "element sample1"
  }

Please see this tool is creating a new element _"textvalue" which is not even present in XSD. Any thoughts on how to resolve this ?

GeethanadhP commented 5 years ago

yes it does create text_value, so that you can access the text data under it

assume the below xml <Error _code="404">Page Not found</Error>

Once the data is loaded to Hive, you would access the attribute _code using Error._code, but as _code is inside of ERROR column, ERROR is not a simple column to store information

so you will have to store the "Page not found" (text value) somewhere else, hence i added the text_value to represent any complex data types text value :)

jkckiran commented 5 years ago

Thanks @roycem90 .But how to store the text_value in any other column. As per your example I assume the avsc will have something like "text_value" under Error element. I tried something like Error.text_value to fetch it but it is not getting picked up properly

GeethanadhP commented 5 years ago

you have to give the full path to .text_value from the root, depending on your xml root

jkckiran commented 5 years ago

Hi @roycem90 My XML along with avsc is given below for your reference:

<?xml version="1.0" encoding="UTF-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
    <con:RESPONSE xmlns:con="http://sample.com/">
    <Applicant AppIndex="1">
        <ApplicantType>01</ApplicantType>
        <Surname>Kiran</Surname>
    </Applicant>
    <Interface errorcount="0" name="Error1"/>
    <Interface errorcount="0" name="Error2">
        <Error code="20">Success - Match</Error>
    </Interface>
    <Interface errorcount="0" name="Error3"/>
    <Interface errorcount="0" name="Error4"/>
    <Interface errorcount="0" name="Error5"/>
    <Interface errorcount="0" name="Error6"/>
    <Interface errorcount="0" name="Error7"/>
</con:RESPONSE>
</soap:Body>
</soap:Envelope>

AVSC :

{
  "type" : "record",
  "name" : "RESPONSE",
  "fields" : [ {
    "name" : "Applicant",
    "type" : [ "null", {
      "type" : "record",
      "name" : "rspApplicantType",
      "fields" : [ {
        "name" : "ApplicantType",
        "type" : [ "null", "string" ],
        "default" : null,
        "source" : "element ApplicantType"
      }, {
        "name" : "ReferenceNumber",
        "type" : [ "null", "string" ],
        "default" : null,
        "source" : "element ReferenceNumber"
      }, {
        "name" : "Surname",
        "type" : [ "null", "string" ],
        "default" : null,
        "source" : "element Surname"
      }, {
        "name" : "Forename",
        "type" : [ "null", "string" ],
        "default" : null,
        "source" : "element Forename"
      }, {
        "name" : "Middlename",
        "type" : [ "null", "string" ],
        "default" : null,
        "source" : "element Middlename"
      }, {
        "name" : "_AppIndex",
        "type" : "string",
        "source" : "attribute AppIndex"
      } ]
    } ],
    "default" : null,
    "source" : "element Applicant"
  }, {
    "name" : "Interface",
    "type" : [ "null", {
      "type" : "array",
      "items" : {
        "type" : "record",
        "name" : "InterfaceType",
        "fields" : [ {
          "name" : "Error",
          "type" : [ "null", {
            "type" : "array",
            "items" : {
              "type" : "record",
              "name" : "InterfaceErrorType",
              "fields" : [ {
                "name" : "text_value",
                "type" : [ "null", "string" ],
                "source" : "element text_value"
              }, {
                "name" : "_code",
                "type" : "string",
                "source" : "attribute code"
              } ]
            }
          } ],
          "default" : null,
          "source" : "element Error"
        }, {
          "name" : "_name",
          "type" : "string",
          "source" : "attribute name"
        }, {
          "name" : "_errorcount",
          "type" : "string",
          "source" : "attribute errorcount"
        } ]
      }
    } ],
    "default" : null,
    "source" : "element Interface"
  } ]
}

My Spark SQL is given below:

SELECT
    CAST(Applicant.ApplicantType AS STRING) AS ApplicantType
    ,CAST(Applicant.Surname AS STRING) AS Surname
    ,Interface._errorcount AS InterfaceErrorCount
    ,Interface._name AS InterfaceErrorName
FROM
    xml_response_extract

But I get the error as

Caused by: java.util.NoSuchElementException: head of empty list

Any idea why this is not working ? If I remove the Interface element completely from the SQL then I was able to load the data into Hive table

GeethanadhP commented 5 years ago

Interface is an array you can't directly use it with . try

SELECT interface[0]._errorcount FROM xml_response_extract
jkckiran commented 5 years ago

@GeethanadhP Thanks but still I am getting the same error. Please note that "Interface" is an empty element except one instance where it has "Error" as sub-elements

GeethanadhP commented 5 years ago

i will try loading the above data and see how it goes

jkckiran commented 5 years ago

Sure. Let me know if you need anything else from my side. Thanks

jkckiran commented 5 years ago

Hi @GeethanadhP did you get a chance to look at this issue ?

GeethanadhP commented 5 years ago

sorry, i totally forgot :(, will check now

GeethanadhP commented 5 years ago
df = spark.read.format('avro').load('output.avro') # The output generated using your sample xml and avsc
df.createTempView('temp_table')
select interface._errorcount from temp_table

this gave me the result of [0,0,0,0,0,0] (Six Zeros) bcoz that is the value for all the _errorcount fields

not sure if thats what you are expecting

I ran the same sql you gave and it ran fine for me, not sure why you are getting an error


SELECT
    CAST(Applicant.ApplicantType AS STRING) AS ApplicantType,
    CAST(Applicant.Surname AS STRING) AS Surname,
    Interface._errorcount AS InterfaceErrorCount,
    Interface._name AS InterfaceErrorName
FROM
    temp_table
jkckiran commented 5 years ago

Can you please let me know the spark version and cloudera version that you are using ? Not sure why it is happening to us. Will dig deep and check

GeethanadhP commented 5 years ago

2.4 spark i have no idea about cloudera version (i was running on databricks.com)

jkckiran commented 5 years ago

ok . Got it