"hidden" _metadata column is not identifying for the XML input file format

databricks / spark-xml

XML data source for Spark SQL and DataFrames

Apache License 2.0

499 stars 226 forks source link

"hidden" _metadata column is not identifying for the XML input file format #648

Closed ChackoSmitha closed 1 year ago

ChackoSmitha commented 1 year ago

While we have tried creating data frame in shared cluster added "hidden" _metadata column as well (col("_metadata.file_path") and _metadata.file_name ) in order to get the input file name . But its giving the result if the input source is csv only for xml input file format its not working as we have different types of source files like xml we couldn’t go for this solution. So please let us know this. Please find below screenshot for same

srowen commented 1 year ago

Sorry, I don't think this data source supports that. It's a DSv2 functionality, I think. It won't be added to this library.

amar-db commented 1 year ago

@srowen, this is Amar from the L'Oréal account team at Databricks. Regarding this _metadata column, Sandeep Chandran from the databricks support team suggested that we get an issue opened on spark-xml library.

"file metadata is only available on built in file source connector. xml is not natively supported and need an external connector to read them. And this connector doesnt support file metadata. They can raise a feature req here: https://github.com/databricks/spark-xml/issues"

Could you please let us know why the library won't support this column?

srowen commented 1 year ago

There is no active work on this library, and I suspect this is related to DSv2 support, which is a significant change. (I can barely maintain with small fixes in spare time; I am not in eng) At the least, if it isn't, then I don't know enough to investigate. Support should talk to eng about working on this library and escalate. There is nobody to punt this work to here.