Problem: InnoDB deadlock/timeout on fpr_fprule table during normalization with multiple MCPClients

uofmsean commented 5 years ago

Expected behaviour For CentOS/RHEL 7, starting a second MCPClient can be accomplished by creating a new systemd unit file as described in Deploy Multiple MCPClients. With multiple clients, batches can be processed concurrently.

Current behaviour During the Normalize microservice, the Normalize for access, or Normalize for preservation jobs can encounter an InnoDB transaction deadlock (error 1213) or timeout (error 1205) when the fpr_fprule table is updated. (See InnoDB engine status examples of detected deadlocks below.) This causes the rest of the batch to fail with other tasks in the batch reporting the error "An error occurred in the current transaction. You can't execute queries until the end of the 'atomic' block." The transfer will be reported as a failure at the end with the JPEG failure counts being a multiple of the batch size. normalize-for-access-innodb-deadlock-1.txt normalize-for-preservation-innodb-deadlock-1.txt

Steps to reproduce Deploy Archivematica 1.9.1 and Storage Service 0.14.1 on a single CentOS/RHEL 7 host Start two MCPClient processes (see Deploy Multiple MCPClients) MCPServer batch size 50 MariaDB 5.5 or MySQL 5.7.x database on a separate host Perform a Standard Transfer/Ingest with a large transfer source containing > 5000 JPEG files Normalize for preservation and access

Your environment (version of Archivematica, OS version, etc) Red Hat Enterprise Linux Server release 7.6 (Maipo) Archivematica 1.9.1 Storage Service 0.14.1 gearmand 1.1.12 (release 18.el7) Elasticsearch 6.8.0 MariaDB 5.5, or MySQL 5.7.26 Enterprise Two Storage Service Spaces: Local disk / (default) for processing and /local/am-storage on NFS for AIP/DIP storage 4 CPU cores 16GB memory

For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

All PRs related to this issue are properly linked 👍
All PRs related to this issue have been merged 👍
Test plan for this issue has been implemented and passed 👍
Documentation regarding this issue has been written and it has been added to the release notes, if needed 👍

jorikvankemenade commented 5 years ago

I have encountered similar problems with Archivematica 1.10 on CentOS running multiple MCP Clients. I must say that for me the threshold is smaller than 5000 JPEG files. But I am running 6 clients divided over 2 hosts, so maybe that changes the observed behaviour a but.

Are there any ideas of what the root cause for this problem is, or where this can be found? A secondary question would be the impact of this bug. It seems that the processing of the transfer continues, so is it merely my mailbox getting bombarded with failure reports or are there more serious problems?

jorikvankemenade commented 4 years ago

Today I ran into this problem again, a 1205 timeout. And the normalization completely fails. This means that no normalized files will be in the saved AIP.

uofmsean commented 4 years ago

I'm still testing, but early indications are that the problem is due to the fpr_fprule counter updates in transcoder.py. In our case, the pairs of normalization rules for different format versions (e.g., PRONOM JPEG fmt/43 and fmt/44) are both executed in different concurrent MCP client batches. Since the fpr_fprule row counters are being updated in batch, the transactions deadlock on attempts to update both of the same rows. Eventually one of the transactions times out leading to the 1205 error.

uofmsean commented 4 years ago

The relevant source lines in transcoder.py are here. During testing, I commented out the count updates and the fprule.save() and it seems to have resolved the problem. I'm not sure what a proper and permanent fix would look like though.

sevein commented 4 years ago

@uofmsean thanks for the report, it's super useful. Another reason why this might be happening could be that we keep a db transaction open for the whole batch (see the use of transaction.atomic in normalize.py). Far from ideal when you have multiple MCPClient instances trying to write to the database. See a different manifestation here: https://github.com/archivematica/Issues/issues/1111#issuecomment-597881905. In antivirus_clamscan, we worked around that by delaying writes.

mjaddis commented 4 years ago

I see this problem when running two MCPClients on a single server with Archivematica 1.9.2 on Ubuntu 18.04 LTS. If I take the batch size down to 4 rather than 128 then I see this when a transfer has just a few tens of files.

jorikvankemenade commented 4 years ago

Maybe good to link @mjaddis' comment in #1161 here:

I commented out the code that updates the FPR stats: https://github.com/artefactual/archivematica/blob/b4dab1d01ea3978dd6c3919f251bab58e394ec77/src/MCPClient/lib/clientScripts/transcoder.py#L153-L159 and have had no issues since then with either normalise.py being running in parallel by fork_runner or by using multiple MCP Clients on the same machine. I did a test using various transfer sizes (100,300,1000,3000,10000 files) using a small batch size (16) and two MCP Clients. Normalisation was successful in all cases.

archivematica / Issues

Problem: InnoDB deadlock/timeout on fpr_fprule table during normalization with multiple MCPClients #752