archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: normalize.py cannot run in parallel #1161

Open mjaddis opened 4 years ago

mjaddis commented 4 years ago

Please describe the problem you'd like to be solved I would like normalize.py to run in parallel on machines that have more than one processor. This is because I have large batches of files and want to parallelize the creation of preservation and access derivatives.

It appears that normalization tasks are all run sequentially because normalize.py doesn't implement concurrent_instances() and hence archivematicaClient.py doesn't think it can use fork_runner.py to execute multiple instances of normalize.py at the same time.

Describe the solution you'd like to see implemented I would like normalize.py to support the concurrent instances method that is used by archivematicaClient.py to decide if it can be run in parallel.

def concurrent_instances():
    return multiprocessing.cpu_count()

I tried adding the above to normalize.py (Archivematica 1.9.2 server, but I suspect the same will happen on 1.10 and 1.11) and some debug statements in fork_runner.py to confirm that it was be run in parallel. However, I get errors from some of the normalization tasks: An error occurred in the current transaction. You can't execute queries until the end of the 'atomic' block. I'm guessing that this could be because of attempts to execute queries against the database when there are multiple instances of normalize.py running at the same time.

Describe alternatives you've considered I did think about running multiple MCPClients on the same machine. However, I saw a note in the validate_file.py client script that says that it can't be run in parallel because multiple instances of MediaConch can't be run on the same server. If I have multiple MCPClients, each of which runs validate_file.py, then I guess there is a risk that multiple instances of MediaConch would then be run at the same time and then fail. Therefore, the ability to run normalize.py in parallel would be better.

Additional context


For Artefactual use:

Before you close this issue, you must check off the following:

sromkey commented 4 years ago

Thanks @mjaddis! I just did a light edit to your title, hope that's ok, we try to state all issues as a problem statement.

jorikvankemenade commented 4 years ago

One thing that might be relevant to mention here is that some of the normalization actions are multithreaded. For example, when converting video files with ffmpeg I am using multiple cores already. IRC it is possible to use ImageMagick multithreaded, I haven't tested if this is the case for Archivematica.

Overall I think it might be worthwhile to see if the normalization rule can be tweaked to use the system better. I think this should ultimately result in a more stable performance for Archivematica (multiple jobs that can each spawn a lot of threads) and probably also a better normalization performance. Assuming that the designers of the tools know what they are doing :).

Is this something you are able to try @mjaddis?

mjaddis commented 4 years ago

@jorikvankemenade You raise some interesting and important points. As you say, some tools, e.g. ffmpeg are already able to run multithreaded and use more than one core. Therefore, running the normalize client script in parallel probably wouldn't make sense in the case of lots of video files. Maybe it might even make things worse because I guess you could end up with multiple ffmpeg instances all fighting for the same set of CPUs. But in other cases, the underlying tool might only be single threaded, e.g. I've extended Archivematica's FPR to normalise MS Word docs using LibreOffice and that only uses one process. So when I have 20,000 docs to normalise (my current test case), it would be great to run multiple instances of normalize.py so that I get multiple instances of LibreOffice running at the same time.

Currently, as I understand it, Archivematica's strategy is to batch up my files (default of 128 in each batch) and then run each batch through the MCPClient which then invokes a suitable client script at each stage of the workflow (file format identification, characterisation, normalisation etc.). If these client scripts can be run in parallel, then my batch of 128 files is subdivided into smaller batches depending on the number of CPU cores, e.g. 4 batches of 32 files, and then each batch is then sent to its own instance of the client script. This all seems sensible to me as a generic strategy and there's loads of tweaking that can be done, e.g. batch sizes, number of instances of each client script to run in parallel etc.

But, this doesn't take account of what the underlying tools can/can't do, e.g. ffmpeg. It also doesn't take account of whether there are issues in running more than one instance of a tool at the same time on the same server. For example, I think there are issues with multiple concurrent instances of MediaConch on the same machine, which means that the validate client script has been hard coded to run all validation tasks sequentially just in case some of them happen to be MediaConch (see code comments in validate_file.py). If you don't have video files and you don't need/want to validate using MediaConch, then the validation step is slower than it needs to be.

It would be really cool if Archivematica could use its FPR to decide what to do. For example, if tools in the FPR had some extra attributes on them, e.g. 'can use multiple processors' and 'safe to run as multiple instances', then some of the client scripts, e.g. normalize.py could look this up and decide what to do. Normalize.py already looks up the normalisation command to run for each of the files in the batch it is told to process, so I guess it could get more info that could be used to decide whether to process more than one file at the same time (e.g. no for ffmpeg and yes for libreoffice). Likewise, in validate_file.py, it could use the FPR to find out whether to run sequentially for MediaConch but in parallel for JHOVE. Basically, rather than running the client scripts in parallel (or not), each client script would decide what is best to do based on the type of files being processed and the tools needed to process them.

We're working with Artefactual on a project called Preservation Action Registries (PAR) and maybe this could be a way of describing the extra attributes on how to run the various tools in Archivematica. https://github.com/JiscSD/rdss-par/

Anyway, easy for me to suggest - probably harder to implement - and there's probably other reasons why Archivematica does the things it does that I haven't thought of!

jorikvankemenade commented 4 years ago

But in other cases, the underlying tool might only be single threaded, e.g. I've extended Archivematica's FPR to normalise MS Word docs using LibreOffice and that only uses one process. So when I have 20,000 docs to normalise (my current test case), it would be great to run multiple instances of normalize.py so that I get multiple instances of LibreOffice running at the same time.

If I understand your case correctly, you want multiple parallel invocations of the LibreOffice normalization. A manual equivalent would be to have multiple terminal windows open, each running the same command on a different set of files?

Currently, as I understand it, Archivematica's strategy is to batch up my files (default of 128 in each batch) and then run each batch through the MCPClient which then invokes a suitable client script at each stage of the workflow (file format identification, characterisation, normalisation etc.).

I think your understanding is correct. This would also mean that in your case of 20.000 files, Archivematica has over 150 batches available. This would mean that if you run multiple MCP clients you will have a lot of potential for concurrency, without tweaking Archivematica. You could even see if increasing or decreasing the batch size makes sense for your particular workload.

In the start of this issue you said:

However, I saw a note in the validate_file.py client script that says that it can't be run in parallel because multiple instances of MediaConch can't be run on the same server.

From experience from me and others, I can say that this is not a problem. I have been running up to 6 MCP Client instances on the same server, and as long as you don't run out of resources you should be good to go. The MCP client is "merely" a worker process that runs commands on behalf of Archivematica. So I would think that this will work without any problems.

Please don't get me wrong, I am all in favor of changing some things up to make Archivematica more scalable for bigger workloads. I am just trying to understand your use-case and see if there is a possibility of using some of the existing scaling methods. If we can find a way, you can keep working without having to do (or wait on) development. At the same time, it will help to understand what the exact problem is you are running into and see how Archivematica can be improved.

mjaddis commented 4 years ago

Yes, multiple MCPClients on the same machine is also an option - and I've been trying that too - including reducing the batch size. I did run into some problems when doing this with AM1.9.2 where some tasks were failing normalisation, e.g.:

(1213, 'Deadlock found when trying to get lock; try restarting transaction')
An error occurred in the current transaction. You can't execute queries until the end of the 'atomic' block.

(1205, 'Lock wait timeout exceeded; try restarting transaction')

An error occurred in the current transaction. You can't execute queries until the end of the 'atomic' block.

Errors were very similar to the ones right at the begining of this post where I has multiple instances of normalize.py running in parallel.

I'd be interested to know if you've ever seen this problem. I'm going to test this with AM1.11 as soon as the deb packages are out and will then log an issue (if there is one).

jorikvankemenade commented 4 years ago

@mjaddis I think you have encountered #752. Unfortunately, this issue pops up sometimes when running a lot of transfers. I also encountered this during my tests using the latest qa branch (so basically 1.11), as you can see it is triaged for 1.12, so hopefully, there will be some progress on this issue. If you have time or ideas, feel free to help out:).

mjaddis commented 4 years ago

@jorikvankemenade I think you are right about #752. The thread for that issue also suggests a possible resolution to my problem by disabling the update of FPR counters, which is something I don't need anyway - so that'll be my next step. So huge thanks for pointing me in that direction. I'll try that next week and update this thread if it solves the problem.

mjaddis commented 4 years ago

I commented out the code that updates the FPR stats: https://github.com/artefactual/archivematica/blob/b4dab1d01ea3978dd6c3919f251bab58e394ec77/src/MCPClient/lib/clientScripts/transcoder.py#L153-L159 and have had no issues since then with either normalise.py being running in parallel by fork_runner or by using multiple MCP Clients on the same machine. I did a test using various transfer sizes (100,300,1000,3000,10000 files) using a small batch size (16) and two MCP Clients. Normalisation was successful in all cases.

jorikvankemenade commented 4 years ago

Sounds like you are making process on this issue, good! What is your experience with using the parallel runner versus changing the batch size and using MCP Clients? Is there a difference in processing time when running the transfers, and do you see any trends in what is the best setup?

mjaddis commented 4 years ago

@jorikvankemenade Good question about the optimal approach. I haven't got the quantitative data to say yet. My objective is to get 100,000 docs through an Archivematica instance as a single Transfer (which I'm happy to say I have now done using AM1.11 on an AWS EC2 instance).

I started benchmarking on AM1.9.2 and producing graphs like the one below. However, when I went to 10,000 and 30,000 files I hit a problem with METS file generation and update which took O(n^2) time for n files - and started to dominate the overall wallclock time - the nice O(n) behaviour in the graph is lost :-( This isn't a problem with AM1.11 so I switched to that.

However, whilst METS processing in AM1.11 is much faster, it still loads the whole METS into memory - and when there's 100,000 files in the Transfer then it gets huge (multiple GBs even when capturing tool output etc. is turned off). This caused the workflow to fail. After upping the memory to 32GB it went through fine.

In the end, I used 8 cores, 32GB of memory and processed 100,000 word docs through AM in 31 hrs (normalisation turned off). With normalisation turned on for MS Word to PDF/A (FPR extension we've applied), then it is at least double that. 4-8 MCP Clients seems to keep all cores busy. I haven't got a full set of stats yet to say whether multiple MCP clients is faster than running normalization.py in parallel. An example of the sort of data I'm collecting is below if you are interested (AM1.11 Transfer containing 10,000 files including normalisation for preservation and access in this case).

image

+----------+--------------------------------------------+-------+-----------------+------------------+
| phase    | type                                       | tasks | wallclock       | tasks_per_second |
+----------+--------------------------------------------+-------+-----------------+------------------+
| SIP      | Normalize                                  | 90013 | 07:50:42.547613 |             3.19 |
| Transfer | Identify file format                       | 10001 | 00:28:34.469632 |             5.83 |
| Transfer | Scan for viruses                           | 10001 | 00:25:18.758152 |             6.59 |
| Transfer | Characterize and extract metadata          | 10004 | 00:22:07.586495 |             7.54 |
| SIP      | Generate AIP METS                          |     1 | 00:05:14.441140 |             0.00 |
| Transfer | Assign file UUIDs and checksums            | 20001 | 00:02:11.568114 |           152.68 |
| Transfer | Generate METS.xml document                 |     1 | 00:01:04.376886 |             0.02 |
| SIP      | Process submission documentation           | 20010 | 00:00:48.651231 |           416.88 |
| Transfer | Validation                                 | 10000 | 00:00:33.864782 |           303.03 |
| Transfer | Extract packages                           |     2 | 00:00:28.585726 |             0.07 |
| SIP      | Prepare AIP                                |     8 | 00:00:24.144336 |             0.33 |
| Transfer | Verify transfer compliance                 | 10005 | 00:00:10.485044 |          1000.50 |
| SIP      | Remove cache files                         | 10000 | 00:00:10.275266 |          1000.00 |
| SIP      | Store AIP                                  |     8 | 00:00:09.572541 |             0.89 |