Open ross-spencer opened 4 years ago
Related to #253, potentially?
Hi @sromkey - yeah, possibly? I connected a few ZIP issues like this. My motivation was to suggest a solution to the team to consider dropping the support for creating AIPs with certain compression types, rather than replacing like-for-like unless there was a strong desire/rationale for a particular algorithm over another.
@sromkey for the triage later, I don't feel there is anything specific to fix here. It's more an act of maintenance - we can either, change to a mult-threaded option, or my preference would be to take the opportunity to remove the LZMA choice to create fewer choices. Small fix, but low-priority.
Changing to the multi-threaded option sounds like it won't be very difficult, let's tackle that for 1.11 and have a bigger discussion about compression algorithms in general for a further release.
Please describe the problem you'd like to be solved.
From the 7-zip docs:
I extracted each of the compression commands adopted in Archivematica for compressing AIPs and created this bash script.
The results are summarized in this spreadsheet.
As you can see, for LZMA compression, even the "quickest" option takes 50 minutes to compress a 29GB package and size reduction is only ~4%. Compare that to 20 minutes for bzip2 with a ~7% reduction.
Describe the solution you'd like to see implemented.
Remove lzma as an option. Potentially introduce lzma2 if some form of lzma is desired. Potentially remove one compression option entirely to reduce the number of options that need to be decided between.
Additionally, for the diminishing gains of the different compression levels shown in the graphs in the attached sheet, consider removing compression levels 7 and 9. (More study or reading might be needed to understand the potential benefit of keeping these levels). We might also consider annotating the UI to describe the impact of certain decisions to users.
Describe alternatives you've considered.
None as such, but it should be noted, that the storage service would still need to be able to understand lzma compression going forward.
Additional context
AIP compression is a bit of a bottle-neck microservice. These results go some way to showing some performance gains that can be made for the reader using lzma, but also, that to some extent it will always be fairly slow while AIPs are being compressed at the end of a workflow.
LZMA2
LZMA2 is not a compression method available in Archivematica but it was easy enough to test alongside the other options to see what the performance may be like in comparison to lzma. Sure enough there were improvements. It's future inclusion could be seen as a fairly arbitrary decision. Opinions are definitely sought on any or all of the above.
For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Done: