archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: some jobs would complete faster when executed for the whole package/batch #867

Open sevein opened 5 years ago

sevein commented 5 years ago

Expected behaviour Archivematica should be able to batch operations to gain better performance.

Current behaviour In some cases, Archivematica processes certain jobs by dispatching tasks per file. The existing batching mechanism reduces some overheads (1 batch = 1 Gearman job), but we're still running operations like SQL inserts individually whereas bulk mode would be much faster.

An example is assign_file_uuids.py. We're still running the same operations over and over for every file, e.g. parsing the METS file during reingest or writing to the database.

Other examples:

Steps to reproduce The previous section shows a list of jobs that when visited from the Dashboard show how tasks are dispatched per file. When compared with future implementations, we should be able to see that grouping operations perform better. In our current performance testing environments, we've seen jobs like "Assign file UUIDs to objects" to rank very high (in 10th position) in the chart of total spent time.

Your environment (version of Archivematica, OS version, etc) qa/1.x


For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Done:

sromkey commented 5 years ago

@sevein is this at all related to https://github.com/artefactual/archivematica/issues/1186 ?

sevein commented 5 years ago

@sevein is this at all related to artefactual/archivematica#1186?

Definitely related. This issue aims to find ways to optimize the code that processes job batches (e.g. fixing obvious inefficiencies like not doing writes to the database in bulk mode) while #1186 is about breaking the batches into smaller groups so they can be processed in parallel, which may also bring performance gains but that's going to depend on a number of things like CPUs available, the kind of tool we're running, etc...

sevein commented 5 years ago

I've found another interesting example in file format identification which we use in a few different places in our workflow.

A tool like Siegfried spends a significant amount of time in the bootstrap process (e.g. loading signatures) becoming a significant overhead when processing a large number of files. Richard Lehane added a server mode to Siegfried (sf -fpr) that we never employed mostly because the FPR was designed to be used to run one-off commands (e.g. sf [path]). That's something worth investigating. Should we have the server mode deployed in all our Archivematica distros? We do that for FITS (Nailgun). Alternatively, I've found that running Siegfried only once for the whole batch already brings considerable gains. E.g. the following example compares running Siegfried once to identify 1k files vs running it Siegfried one per file:

$ time bash -c "for i in many-xmls/*; do sf -json \$i > /dev/null; done;"
69.54user 25.83system 0:57.07elapsed 167%CPU (0avgtext+0avgdata 95952maxresident)k
114912inputs+0outputs (114major+22897244minor)pagefaults 0swaps

$ time bash -c "sf -json many-xmls/ > /dev/null"
10.21user 0.60system 0:09.02elapsed 119%CPU (0avgtext+0avgdata 110584maxresident)k
0inputs+0outputs (0major+39570minor)pagefaults 0swaps

FPR commands however were only designed after individual files so it's not clear how we would use commands in bulk mode. This is something that CWL has apparently solved already, see the attribute File in the CommonWorkflowTool class.

ross-spencer commented 5 years ago

Alternatively, I've found that running Siegfried only once for the whole batch already brings considerable gains. E.g. the following example compares running Siegfried once to identify 1k files vs running it Siegfried one per file.

@sevein what is the trade-off when we have to parse the bulk output of SF and store it against the individual file entries in the database?

cole commented 5 years ago

Quick note that if we're still looking at tradeoffs, preserving one command per file would likely be better in terms of future scalability for very large (in terms of number of files) transfers, as we would still be able to split up jobs across multiple mcp clients. So I guess +1 for server mode from me.

evelynPM commented 4 years ago

I've noticed this problem for a user performing AIP reingest on AIPs with large numbers of files. In this case, Assign file UUIDs and checksums is extremely slow, presumably because Archivematica is rereading a very large AIP METS file for each task.