eprintsug / EPrintsArchivematica

Digital Preservation through EPrints-Archivematica Integration - An EPrints export plugin to Archivematica
6 stars 1 forks source link

matadata.json export causes issue on import due to integer identifier/relation #31

Closed photomedia closed 3 years ago

photomedia commented 3 years ago

I just opened an issue at Archivematica about this: https://github.com/archivematica/Issues/issues/1462 The metadata.json file can include fields that look like this: [ { "dc.relation": [ "https:\/\/someurl.ca\/99", "some other identifier", 824675433 ] }] That integer value is not in quotes, which causes a failure on import there. I might try to patch our export so that the identifier is listed in double quotes, even though it is an integer.

photomedia commented 3 years ago

Actually, small correction to this issue description: the integer comes from a DOI field, which was set to a number in a test record. A real DOI would have some punctuation, so it wouldn't generate this issue, but still, there might very well sometime be an "identifier" in the metadata that is just a number.

To patch this up on our (EPrints) side, I have modified our DC export so that the id_number is only exported (as "DC relation") when it matches the DOI regex. This means that a pure integer should not be getting exported, avoiding the exposing of this weakness in the Archivematica import scripts. In most repositories (including Concordia), id_number is supposed to hold a DOI anyway. At Concordia we actually have a separate field for other non-DOI identifiers (not exported with DC). That means this change will only serve to limit the export of unexpected identifiers.

photomedia commented 3 years ago

After switching from the EPrints to the Perl JSON encoder, all values are exported in quotes, even integer values. This means this issue is resolved, but the code shouldn't assume every multi-value field is a string.