JiscSD / archivematica

Free and open-source digital preservation system designed to maintain standards-based, long-term access to collections of digital objects.
http://www.archivematica.org
GNU Affero General Public License v3.0
0 stars 0 forks source link

Problem: stdout from compress aip event too large when AIP has many files #39

Open jhsimpson opened 6 years ago

jhsimpson commented 6 years ago

The compressAIP client script runs a compression utility and records the standard out and standard error from that tool in the database. The code doing this is here: https://github.com/JiscRDSS/archivematica/blob/qa/jisc/src/MCPClient/lib/clientScripts/compressAIP.py#L83-L103

When the AIP being compressed contains thousands of files, the std out gets very large, and the extra output is not useful. In one example, an AIP with 37,000 original files, the aip compression premis event recorded by this client script became over 99% of the total content of the AIP's pointer file. The output is just endless lines starting with 'compressing x . ..'

The pointer file becomes unusable and can cause failures in the storage service when the aip is stored.
(example here rdss-archivematica#106).

It would be better to ignore the std out of this tool, not write it to the database at all and allow the premis event outcome detail note to be empty.

It is worth pointing out that there is related work going on in the upstream project - documented here: https://github.com/artefactual-labs/archivematica-acceptance-tests/pull/37

That work is intended to be released by the end of 2017. It would be useful to change just this one compressAIP client script here in the JiscRDSS repo and test it with the large datasets available in the Jisc environment, and then consider how best to merge with the ogoing work upstream.