databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

Remove New Line coming in between records during spark write dataframe to XML #676

Closed avinashpandu closed 6 months ago

avinashpandu commented 6 months ago

Hi, When writing the following dataframe to XML , I am seeing new blank lines inserting in between the ROWS. +--------------------+--------------------+--------------+----------+----------+--------------+------+----------+-------------+--------------------+------------------+------------------+ |PORTFOLIO_MANAGER_ID|EFFECTIVE_START_DATE|DATA_SOURCE_ID| LAST_NAME|FIRST_NAME|MIDDLE_INITIAL|SUFFIX|SHORT_NAME|QUALIFICATION| BIOGRAPHY|SOURCE_SYSTEM_CODE|EFFECTIVE_END_DATE| +--------------------+--------------------+--------------+----------+----------+--------------+------+----------+-------------+--------------------+------------------+------------------+ | 10247| 2019-09-30| 100| McPherson| Heather| null| null| null| null|  &nbsp...| MA_1066| 9999-12-31| | 4574| 2018-08-31| 100| Mordy| James| N.| null| null| null|~~~~~~<di...| MA_68| 9999-12-31| | 4249| 2018-08-31| 100|Lakonishok| Dr. Josef| null| null| null| null|

■ <stron...| MA_209| 9999-12-31| +--------------------+--------------------+--------------+----------+----------+--------------+------+----------+-------------+--------------------+------------------+------------------+ XML

10247 2019-09-30 100 McPherson Heather &nbsp;&nbsp;&nbsp; <p>Heather McPherson joined as co-portfolio manager of the strategy in 2015. Ms. McPherson has been with the firm since 2002 and served as associate portfolio manager for the US Mid-Cap Value Equity Strategy from 2005 through 2014.</p>&nbsp;</p> MA_1066 9999-12-31 4574 2018-08-31 100 Mordy James N. ~~~~~~~~~~~~~~<div>~~<p>&#9632;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <b>James~N. Mordy</b>is a Senior Managing Director and portfolio manager at~Wellington Management. �Mr. Mordy received a B.A. from Stanford University and~an M.B.A. from the Wharton School of The University of Pennsylvania.</p>~~</div>~~</body>~~</html>~ MA_68 9999-12-31 4249 2018-08-31 100 Lakonishok Dr. Josef <p>&#9632; <strong>Dr. Josef Lakonishok</strong>, is chief executive officer and founding partner of LSV Asset Management and Portfolio Manager</p><p>for the Fund. He received a B.A. in Economics and Statistics from Tel Aviv University in 1970 and an M.B.A. from Tel Aviv</p><p>University in 1972. Dr. Lakonishok earned a PhD in Business Administration in 1976 from Cornell University.</p> MA_209 9999-12-31

Issue1

Blank Empty Line created between Row Tags-

<T_REF_PORTFOLIO_MANAGER>

Issue 2 < is getting converted to &lt ; & is getting converted to &amp ; But, XML Writer is not converting > to "&gt"; ( please refer the BIOGRAPHY tag )

Can you please help me understand this? Thank you!

srowen commented 6 months ago

I don't think either of those are problems. An extra newline has no meaning between tags, and, you don't need to escape close tags as they are not opened.

Both of these behaviors are not from this library, but from the standard javax.xml.stream writer.

This library is not maintained anymore either, being part of Spark now.

avinashpandu commented 6 months ago

Thank you. Regarding Issue 2, there is an open < tag and it got converted to &lt ; the same is not happening for >. why is that ?

srowen commented 6 months ago

I presume because there is no need to escape a close tag if there is no open tag. You can see the code in StaxXmlGenerator that just uses the JDK's XML writer to do all this, so this is behavior of the standard library, I assume. https://github.com/databricks/spark-xml/blob/3b40ef48bce114fa9f1f4d96ae500d74dac0ea91/src/main/scala/com/databricks/spark/xml/parsers/StaxXmlGenerator.scala#L22

What is the issue?

avinashpandu commented 6 months ago

I understand your comment Sean. Thank you!

But the string is as follows :

Gregory

Gregory< :div>< ;p>< /p I need all the > to be escaped during XML Write. On Thu, Feb 8, 2024, 9:15 AM Sean Owen ***@***.***> wrote: > I presume because there is no need to escape a close tag if there is no > open tag. You can see the code in StaxXmlGenerator that just uses the JDK's > XML writer to do all this, so this is behavior of the standard library, I > assume. > https://github.com/databricks/spark-xml/blob/3b40ef48bce114fa9f1f4d96ae500d74dac0ea91/src/main/scala/com/databricks/spark/xml/parsers/StaxXmlGenerator.scala#L22 > > What is the issue? > > — > Reply to this email directly, view it on GitHub > , > or unsubscribe > > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >
srowen commented 6 months ago

Why? But, again: this would be behavior of the standard JDK writer. I don't know if it's controllable. If it is, you'd have to open a PR against Spark, not this library. But I don't know if there is a reason to do this.