Schema for stringvalue not inferred correctly

ShubhamG25 commented 1 year ago

Hi ,

I am trying to generate the schema from a complex XML column in pyspark dataframe using the function below.

def ext_schema_of_xml_df(df, options={}):
    assert len(df.columns) == 1

    scala_options = spark._jvm.PythonUtils.toScalaMap(options)
    java_xml_module = getattr(getattr(
        spark._jvm.com.databricks.spark.xml, "package$"), "MODULE$")
    java_schema = java_xml_module.schema_of_xml_df(df._jdf, scala_options)
    return _parse_datatype_json_string(java_schema.json())

When trying to generate the schema with the field value '8E9N' data type returned as StringType but when trying with value that ends with letter 'D' or 'F' (exa - '8E9D', '8E8F') datatype returned as DoubleType . Ideally it should also be treated as StringType.

Kindly find attached the screenshot and code to reproduce the issue.


# Create a DataFrame with a single column
df = spark.createDataFrame([(1,)], ["id"])

# Create an XML column with the desired value
df = df.withColumn("XML_Column", expr(
    'concat_ws("", '
    '    "<Root>", '
    '    "<contract>", '
    '    "<contract_num>8E9N</contract_num>", '
    '    "</contract>", '
    '    "</Root>"'
    ')'
))

payloadSchema = ext_schema_of_xml_df(df.select("XML_Column"))
print(payloadSchema)


# Create a DataFrame with a single column
df = spark.createDataFrame([(1,)], ["id"])

# Create an XML column with the desired value
df = df.withColumn("XML_Column", expr(
    'concat_ws("", '
    '    "<Root>", '
    '    "<contract>", '
    '    "<contract_num>8E9D</contract_num>", '
    '    "</contract>", '
    '    "</Root>"'
    ')'
))

payloadSchema = ext_schema_of_xml_df(df.select("XML_Column"))
print(payloadSchema)

# Create a DataFrame with a single column
df = spark.createDataFrame([(1,)], ["id"])

# Create an XML column with the desired value
df = df.withColumn("XML_Column", expr(
    'concat_ws("", '
    '    "<Root>", '
    '    "<contract>", '
    '    "<contract_num>8E8F</contract_num>", '
    '    "</contract>", '
    '    "</Root>"'
    ')'
))

payloadSchema = ext_schema_of_xml_df(df.select("XML_Column"))
print(payloadSchema)

srowen commented 1 year ago

Oh, wild. "D" is a suffix meaning "double" in Java, and "F" means float. "E" is of course a way to specify scientific notation. So "8E9D" works as 8.0 x 10^9. I wonder what we want to support here. Not "D" and "F" suffixes I think. I could imagine supporting "9.3E-3" or something as a double. OK I can change that

ShubhamG25 commented 1 year ago

Thanks @srowen for the quick fix. I believe the changes will start reflecting in next release.

databricks / spark-xml

Schema for stringvalue not inferred correctly #643