Open sevein opened 5 years ago
@sevein is this at all related to https://github.com/artefactual/archivematica/issues/1186 ?
@sevein is this at all related to artefactual/archivematica#1186?
Definitely related. This issue aims to find ways to optimize the code that processes job batches (e.g. fixing obvious inefficiencies like not doing writes to the database in bulk mode) while #1186 is about breaking the batches into smaller groups so they can be processed in parallel, which may also bring performance gains but that's going to depend on a number of things like CPUs available, the kind of tool we're running, etc...
I've found another interesting example in file format identification which we use in a few different places in our workflow.
A tool like Siegfried spends a significant amount of time in the bootstrap process (e.g. loading signatures) becoming a significant overhead when processing a large number of files. Richard Lehane added a server mode to Siegfried (sf -fpr
) that we never employed mostly because the FPR was designed to be used to run one-off commands (e.g. sf [path]
). That's something worth investigating. Should we have the server mode deployed in all our Archivematica distros? We do that for FITS (Nailgun). Alternatively, I've found that running Siegfried only once for the whole batch already brings considerable gains. E.g. the following example compares running Siegfried once to identify 1k files vs running it Siegfried one per file:
$ time bash -c "for i in many-xmls/*; do sf -json \$i > /dev/null; done;"
69.54user 25.83system 0:57.07elapsed 167%CPU (0avgtext+0avgdata 95952maxresident)k
114912inputs+0outputs (114major+22897244minor)pagefaults 0swaps
$ time bash -c "sf -json many-xmls/ > /dev/null"
10.21user 0.60system 0:09.02elapsed 119%CPU (0avgtext+0avgdata 110584maxresident)k
0inputs+0outputs (0major+39570minor)pagefaults 0swaps
FPR commands however were only designed after individual files so it's not clear how we would use commands in bulk mode. This is something that CWL has apparently solved already, see the attribute File in the CommonWorkflowTool
class.
Alternatively, I've found that running Siegfried only once for the whole batch already brings considerable gains. E.g. the following example compares running Siegfried once to identify 1k files vs running it Siegfried one per file.
@sevein what is the trade-off when we have to parse the bulk output of SF and store it against the individual file entries in the database?
Quick note that if we're still looking at tradeoffs, preserving one command per file would likely be better in terms of future scalability for very large (in terms of number of files) transfers, as we would still be able to split up jobs across multiple mcp clients. So I guess +1 for server mode from me.
I've noticed this problem for a user performing AIP reingest on AIPs with large numbers of files. In this case, Assign file UUIDs and checksums is extremely slow, presumably because Archivematica is rereading a very large AIP METS file for each task.
Expected behaviour Archivematica should be able to batch operations to gain better performance.
Current behaviour In some cases, Archivematica processes certain jobs by dispatching tasks per file. The existing batching mechanism reduces some overheads (1 batch = 1 Gearman job), but we're still running operations like SQL inserts individually whereas bulk mode would be much faster.
An example is
assign_file_uuids.py
. We're still running the same operations over and over for every file, e.g. parsing the METS file during reingest or writing to the database.Other examples:
Steps to reproduce The previous section shows a list of jobs that when visited from the Dashboard show how tasks are dispatched per file. When compared with future implementations, we should be able to see that grouping operations perform better. In our current performance testing environments, we've seen jobs like "Assign file UUIDs to objects" to rank very high (in 10th position) in the chart of total spent time.
Your environment (version of Archivematica, OS version, etc) qa/1.x
For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Done: