IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
886 stars 495 forks source link

6.1+/EZID: publish dataset fails when metadata contains ampersands #10830

Open anarchivist opened 2 months ago

anarchivist commented 2 months ago

What steps does it take to reproduce the issue?

  1. have an instance of Dataverse 6.1 or higher using EZID for DOI minting
  2. create a new dataset that has metadata containing an (unescaped) ampersand in it (i.e. & instead of &).
  3. attempt to publish the dataset

Which version of Dataverse are you using?

6.1

Any related open or closed issues to this bug report?

3328, #3845, #7611

Are you thinking about creating a pull request for this issue?

not at this point; existing workaround to replace ampersands with "and" will work for us

server.log ``` [2024-09-09T14:16:45.442-0700] [Payara 6.2023.9] [WARNING] [] [edu.harvard.iq.dataverse.DOIEZIdServiceBean] [tid: _ThreadID=109 _ThreadName=http-thread-pool::jk-connector(4)] [timeMillis: 1725916605442] [levelValue: 900] [[ modifyMetadata failed]] [2024-09-09T14:16:45.442-0700] [Payara 6.2023.9] [WARNING] [] [edu.harvard.iq.dataverse.DOIEZIdServiceBean] [tid: _ThreadID=109 _ThreadName=http-thread-pool::jk-connector(4)] [timeMillis: 1725916605442] [levelValue: 900] [[ String edu.ucsb.nceas.ezid.EZIDException: bad request - error="ValidationError({'datacite': ['Metadata validation error: XML parse error: EntityRef: expecting \';\', line 6, column 40 (, line 6). metadata="\n\n 10.60503/D3/FFXLIL\n S&P Global\n \n RateWatch Scholar\n \n UC Berkeley Library Dataverse\n 2024\n \n \n \n RateWatch Scholar offers the academic community information for U.S. financial institutions for research and analysis. Data covers over 96,000 branch locations, depending on time period and data type, all provided voluntarily. Data is gathered from institutions of all types and sizes, including banks, credit unions, savings and loan associations, etc. The RateWatch Historical data sets focus on retail products offered to the general public. Deposit rates data: 2001 - 2020 Loan rates data: 2022 Fee data:\n \n Library Data Services Program(UC Berkeley)\n"']})" Metadata: {'datacite': '\n' '\n' ' 10.60503/D3/FFXLIL\n' ' S&P ' 'Global\n' ' \n' ' RateWatch Scholar\n' ' \n' ' UC Berkeley Library Dataverse\n' ' 2024\n' ' \n' ' \n' ' \n' ' RateWatch ' 'Scholar offers the academic community information for U.S. ' 'financial institutions for research and analysis. Data covers ' 'over 96,000 branch locations, depending on time period and data ' 'type, all provided voluntarily. Data is gathered from ' 'institutions of all types and sizes, including banks, credit ' 'unions, savings and loan associations, etc. The RateWatch ' 'Historical data sets focus on retail products offered to the ' 'general public. Deposit rates data: 2001 - 2020 Loan rates data: ' '2022 Fee data:\n' ' \n' ' Library Data ' 'Services Program(UC ' 'Berkeley)\n' '', 'datacite.resourcetype': 'Dataset'}]] ```
bencomp commented 1 month ago

A cursory look at the code on the current develop branch makes me think there are no unit tests that check the escaping of XML, although there is a test edu.harvard.iq.dataverse.pidproviders.doi.datacite.XmlMetadataTemplateTest that checks simpler values against an XML Schema for DataCite.

The XmlMetadataTemplate uses the standard XmlStreamWriter, which automatically escapes strings for XML. If I modify the values in the above test and run it, they are escaped correctly.

It makes me think that even though the error you see mentions the DataCite schema, the service doesn't use the XmlMetadataTemplate. The develop branch doesn't include the edu.harvard.iq.dataverse.DOIEZIdServiceBean anymore; edu.harvard.iq.dataverse.pidproviders.doi.ezid.EZIdDOIProvider doesn't use the template, but the externally developed EZIDService. Apparently, that service doesn't escape strings for XML.

So it appears that the root problem is not in Dataverse (at least for this issue ;)).