datacite / bolognese

Ruby gem and command-line utility for conversion of DOI metadata
MIT License
40 stars 14 forks source link

Attributes on DC XML relatedItem properties appear as hashes in DC JSON #149

Closed codycooperross closed 1 year ago

codycooperross commented 1 year ago

Describe the bug

When reading DC XML, attributes on certain relatedItem properties will appear as hashes in DC JSON. This causes indexing errors in lupo because ES expects keywords for certain relatedItem properties and receives hashes.

Expected Behaviour

Only the content of relatedItem properties appears as a values in DC JSON.

Steps to Reproduce

Read a DC XML relatedItem property like this:

<volume xml:lang="en">RR-175</volume>

It will appear like this in DC JSON:

"volume": { "lang": "en", "__content__": "RR-175" },

Context (Environment)

This affects indexing in lupo, returning 500 errors for DOI metadata updates (ex. 10.4224/40002814) and causing certain DOIs not to appear in the index (ex. 10.4224/40002702). See this Sentry error for the former.

Proposal

Hypothesis

Possible Implementation

The code here will likely need to be changed to accommodate the possibility of attributes:

https://github.com/datacite/bolognese/blob/master/lib/bolognese/readers/datacite_reader.rb#L197-L243

At some point, the XSD might be modified to exclude the possibility of attributes for these properties.

Front logo Front conversations