MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

ampersands are escaped even when part of ![CDATA[]] during static harvest #330

Open ghukill opened 5 years ago

ghukill commented 5 years ago

Been reported that ampersands in XML records during static harvest, even when enclosed in <![CDATA[]]> tags, are replaced with &amp;.

This is not ideal when ampersands are required output, and would be beneficial if <![CDATA[]]> were untouched.

ghukill commented 5 years ago

Confirmed.

Including this string in XML record:

<mods:location>
         <mods:url usage="primary"
            ><![CDATA[http://digital.library.wayne.edu/item/wayne:Livingto1876b22354748?goober=tronic&horse=smelt]]></mods:url>
      </mods:location>

is returned as this after harvest:

<mods:location>
         <mods:url usage="primary">http://digital.library.wayne.edu/item/wayne:Livingto1876b22354748?goober=tronic&amp;horse=smelt</mods:url>
      </mods:location>

<![CDATA[]]> is gone, and & has been encoded as &amp;.

Using this to parse XML for static harvests: https://github.com/databricks/spark-xml#hadoop-inputformat

Which, appears to use XmlInputFormat from Apache Mahout project.