archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: the default preservation normalization rule for video files results in very large preservation masters #912

Closed evelynPM closed 4 years ago

evelynPM commented 4 years ago

Expected behaviour Default normalization rules should balance obsolescence concerns with practical considerations of filesize and disk space usage.

Current behaviour Archivematica has a default normalization rule for AV files which generates large ffv1/mkv preservation masters if "normalize for preservation" is selected during processing. In many cases normalization of video files may be unnecessary, as many video formats being ingested are ubiquitous and well-supported. We should review when and how this normalization rule is applied, possibly removing the default for many video filetypes.

Note that we may want to review the normalization rule for common raster image formats as well, since they are being normalized to uncompressed TIFF 6.0, which can also be quite large.

Steps to reproduce Ingest a compressed video file in a standard format such as MP4 and normalize for preservation.

Your environment (version of Archivematica, OS version, etc)


For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Done:

evelynPM commented 4 years ago

We could base decisions on something like NARA's format risk and preservation action plan analysis work: https://github.com/usnationalarchives/digital-preservation.

evelynPM commented 4 years ago

Library of Congress has a recently updated list of preferred and acceptable formats at https://www.loc.gov/preservation/resources/rfs/index.html.

evelynPM commented 4 years ago

I posted a question about this to the Archivematica user's forum and other discussion lists. Excerpts from responses are in the attached pdf file. Also of interest is a recent discussion at the Archivematica User Forum bi-monthly call on November 7: (also attached here as a pdf). Archivematica default normalization rules discussion.pdf Meeting Minutes - 2019-11-07.pdf

evelynPM commented 4 years ago

After receiving feedback and discussing the issue internally, we have decided on changes to Archivematica’s default preservation normalization rules for video and still image files. Please note that these are rules for preservation only, not access. Also note that these changes will not remove the ability to normalize these types of files for preservation - they just mean that it won’t happen by default when you install Archivematica.

We will remove default preservation normalization for video files entirely, for the following reasons:

1) Wholesale video normalization can require excessive use of Archivematica processing resources and result in very large AIPs, sometimes 10-20 times the size of the original transfer 2) Highly compressed video files don't look good when they're converted to a lossless preservation format - there's too much data that was removed during compression and removing the compression doesn't bring the data back. 3) Standard wrapper formats like AVI and MOV typically already contain preservation-friendly codecs, and normalizing them doesn't accomplish much. 4) For many video formats, there is ubiquitous support for both playback and file conversion. Ironically, the fact that it's easy to convert the files now means that you don't really have to, because that ability to convert will persist and even improve for the foreseeable future. The FFmpeg project, for example, is doing a lot of work on maintaining significant properties during format conversions, so it could be that waiting a few years to normalize video files could result in the production of better preservation masters.

We will also remove default preservation normalization for the following still image formats because they are ubiquitous, standardized and can be rendered in any web browser: PNG, JPG and GIF. We considered removing the command for BMP as well, but changed our minds given that http://fileformats.archiveteam.org/wiki/BMP has this to say about BMP: “Though seemingly a simple format, it is complicated by its many different versions, lack of an official specification, lack of any version control process, and ambiguities and contradictions in the documentation.”

Another format for which we will remove default preservation is DNG, which is considered a preservation-friendly format . See http://fileformats.archiveteam.org/wiki/DNG and https://www.loc.gov/preservation/digital/formats/fdd/fdd000188.shtml for more information about the DNG (Adobe Digital Negative) format.

Note that PNG, JPG, GIF and DNG are listed as preferred still image formats by Library of Congress: see http://www.loc.gov/preservation/resources/rfs/stillimg.html.

We will keep TIFF to TIFF as a default normalization rule because TIFFs can contain a variety of compression algorithms, proprietary and non-proprietary; Archivematica's default normalization command results in uncompressed TIFFs. We will also keep JPEG 2000 to TIFF as a default normalization rule because the format has some licensing issues which may prevent it from being fully supported by open-source software. See, for example, https://github.com/archivematica/Issues/issues/91 and https://bugs.launchpad.net/ubuntu/+source/openjpeg2/+bug/711061.

ablwr commented 4 years ago

I am working under the assumption that we want to just deactivate and de-associate these rules/commands rather than remove them entirely. That seems to be the safest for situations in which we are upgrading people who use these rules, even if it leaves behind some 'baggage'.

sromkey commented 4 years ago

I am working under the assumption that we want to just deactivate and de-associate these rules/commands rather than remove them entirely. That seems to be the safest for situations in which we are upgrading people who use these rules, even if it leaves behind some 'baggage'.

Yes, I agree @ablwr .

sallain commented 4 years ago

First I confirmed that the various normalize for preservation rules for video formats in the FPR had been disabled. I spot-checked rather than looking at every format, but it all seems to be in order.

We have two sample transfers that contain video. I've observed the following:

SampleTransfers/Matroska:

SampleTransfers/Multimedia:

That all seems to be as expected with this change, so that's great!

I then tried to re-enable one of the preservation normalization rules - I re-enabled the preservation rule for Generic MOV files (MOV > MKV). I re-ran SampleTransfers/Multimedia and normalized for preservation. The MOV file was normalized to MKV for preservation as per the rule.

@ablwr is there anything else that you think I should test?

ablwr commented 4 years ago

That sounds great to me! Nice to see this change roll in. 1.11 is gonna be great!