archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: normalization rules are not related to a format's "preservation format?" or "access format?" boolean values. #1185

Open peterVG opened 4 years ago

peterVG commented 4 years ago

Expected behaviour Format "preservation format?" and "access format?" values are actionable and directly linked to normalization rules. Archivematica will not normalize files that have their format "preservation format?" or "access format?" set to True. Archivematica will provide comprehensive, reliable, and consistent normalization rules for all the formats that it can identify and for which "preservation format?" or "access format?" is set to False.

Alternatively, Archivematica does not provide any default normalization rules at all and removes the "preservation format?" and "access format?" settings. Instead, it would be left to the Archivematica implementer to set normalization rules according to their institutional policies rather than relying on the Archivematica development team to determine their policies by default.

Current behaviour Format "preservation format?" and "access format?" values have no effect on normalization rules and don't appear to be used at all in Archivematica except for determining the values in the "Already in preservation format" and "Already in access format" columns of the normalization report.

If normalization for preservation is selected and Archivematica encounters a file with a format that has its "preservation format?" set to True, when a normalization rule has been created for that format, Archivematica will transcode this file, disregarding the "Preservation format? = true" setting. Conversely, if normalization for preservation is selected and Archivematica encounters a file with a format that has its "preservation format?" set to False, it will only normalize to a preservation format if a rule has been created by the user or if one exists as part of the default Archivematica Format Policy Registry (FPR) settings.

However, comprehensive normalization rules for all the file formats that Archivematica can identify do not exist. Many formats that have a "preservation format?" or "access format?" set to False do not have a normalization rule. This means the normalization report will indicate that preservation normalization was not attempted. It is not clear that the reason for this is that there is no normalization rule for that format.

Additionally, some file formats within the same format families have inconsistent settings for "preservation format?" or "access format?". This leads to confusing results in normalization reports. For example, in the MS-WORD example below, the files with the Microsoft Word 97 - 2003 format are shown to already be in their preservation format, whereas the files in the successor Microsoft Word 2007+ format are shown to be NOT in a preservation format and are flagged with a red cell. The default FPR rules are not consistent about setting the "preservation format?" and "access format?" to have the same value (either True or False) within format families.

WordDocsNormalisationInfo

Considering the MS-Word example specifically, very early releases of Archivematica had a preservation normalization rule to transcode them to PDF/A format using a headless version of Open/LibreOffice. However, Archivematica users were experiencing poor and unpredictable results in these (often lossy) conversions. At the same time Microsoft had switched to using XML formats for native MS-Office files and MS-Office included the ability to migrate older MS-Office files forward in newer versions. Furthermore, MS-Office files were proving to be ubiquitous in use worldwide. For all these reasons native MS-Office formats were determined to serve as more reliable preservation copies than conversions to PDF/A. The default normalization rules were changed in Archivematica as a result quite some years ago. However, current Archivematica users that haven't consulted the Format Policy Registry in depth or that don't have experience with the Ghostscript tool that does PDF to PDF/A conversions might expect that MS-Word files that do have "preservation format?" set to False, should normalize them to PDF/A, which is not the case.

Steps to reproduce Ingest a transfer that has both MS-Word 97 and MS-Word 2007 formats (e.g. archivematica-sampledata -> SampleTransfers -> OfficeDocs). Select normalize for preservation. Check normalization report.

Your environment (version of Archivematica, operating system, other relevant details)

Archivematica 1.11


For Artefactual use:

Before you close this issue, you must check off the following:

sromkey commented 4 years ago

I think there is an in-between option in the expected behaviours for this- that Archivematica could ship with default normalization rules, as it currently does, but just remove the concept of formats being "preservation" or "access" formats. The rules kind of imply that they are but it just simplifies the model a bit.

I'm also not sure about labelling this as a bug because as far as I know the expected behaviour as described wasn't ever intentionally written in but @jhsimpson or @evelynPM might have a better recollection.