archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: 7-zip LZMA compression in Archivematica can only use 2-threads, and is slower than the other compression methods #917

Open ross-spencer opened 4 years ago

ross-spencer commented 4 years ago

Please describe the problem you'd like to be solved.

From the 7-zip docs:

mt=[off | on | {N}]
Sets multithread mode. If you have a multiprocessor or multicore system, you can get a increase with this switch. 7-Zip supports multithread mode only for LZMA / LZMA2 compression and BZip2 compression / decompression. If you specify {N}, for example mt=4, 7-Zip tries to use 4 threads. LZMA compression uses only 2 threads.

I extracted each of the compression commands adopted in Archivematica for compressing AIPs and created this bash script.

The results are summarized in this spreadsheet.

image

As you can see, for LZMA compression, even the "quickest" option takes 50 minutes to compress a 29GB package and size reduction is only ~4%. Compare that to 20 minutes for bzip2 with a ~7% reduction.

Describe the solution you'd like to see implemented.

Remove lzma as an option. Potentially introduce lzma2 if some form of lzma is desired. Potentially remove one compression option entirely to reduce the number of options that need to be decided between.

Additionally, for the diminishing gains of the different compression levels shown in the graphs in the attached sheet, consider removing compression levels 7 and 9. (More study or reading might be needed to understand the potential benefit of keeping these levels). We might also consider annotating the UI to describe the impact of certain decisions to users.

Describe alternatives you've considered.

None as such, but it should be noted, that the storage service would still need to be able to understand lzma compression going forward.

Additional context

AIP compression is a bit of a bottle-neck microservice. These results go some way to showing some performance gains that can be made for the reader using lzma, but also, that to some extent it will always be fairly slow while AIPs are being compressed at the end of a workflow.

LZMA2

LZMA2 is not a compression method available in Archivematica but it was easy enough to test alongside the other options to see what the performance may be like in comparison to lzma. Sure enough there were improvements. It's future inclusion could be seen as a fairly arbitrary decision. Opinions are definitely sought on any or all of the above.


For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Done:

sromkey commented 4 years ago

Related to #253, potentially?

ross-spencer commented 4 years ago

Hi @sromkey - yeah, possibly? I connected a few ZIP issues like this. My motivation was to suggest a solution to the team to consider dropping the support for creating AIPs with certain compression types, rather than replacing like-for-like unless there was a strong desire/rationale for a particular algorithm over another.

ross-spencer commented 4 years ago

@sromkey for the triage later, I don't feel there is anything specific to fix here. It's more an act of maintenance - we can either, change to a mult-threaded option, or my preference would be to take the opportunity to remove the LZMA choice to create fewer choices. Small fix, but low-priority.

sromkey commented 4 years ago

Changing to the multi-threaded option sounds like it won't be very difficult, let's tackle that for 1.11 and have a bigger discussion about compression algorithms in general for a further release.