archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: move_transfer does not update the database in time #1111

Open jorikvankemenade opened 4 years ago

jorikvankemenade commented 4 years ago

Expected behaviour When the MCP Client runs move_transfer successfully the transfer is moved to the correct location and the currentLocation field of the transfer matches the new location.

Current behaviour I have had the following situation:

image

The transfer fails because of a problem in the "Store file modification dates" job. The logs show the following problem:

feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: archivematicaClient.py: INFO      2020-02-21 14:49:28,662  archivematica.mcp.client:execute_command:208:
feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: *** RUNNING TASK: storefilemodificationdates_v0.0
feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: archivematicaClient.py: INFO      2020-02-21 14:49:28,685  archivematica.mcp.client:write_task_results_callback:220:
feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: *** Completed job: #<storefilemodificationdates_v0.0; exit=1; code=success uuid=f88a74bc-d9d8-41d7-9917-e4086121ca93
feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: =============== STDOUT ===============
feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: =============== END STDOUT ===============
feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: =============== STDERR ===============
feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: [Errno 2] No such file or directory: '/cephfs/sharedDirectory/watchedDirectories/workFlowDecisions/selectFormatIDToolTransfer/Images_2-3b3261ef-4b40-4e48-91db-4a5eedf7344c/objects/799px-Euroleague-LE_
feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: File "/usr/lib/archivematica/MCPClient/job.py", line 111, in JobContext
feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: yield
feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: File "/usr/lib/archivematica/MCPClient/clientScripts/store_file_modification_dates.py", line 75, in call
feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: main(transfer_uuid, shared_directory_path)
feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: File "/usr/lib/archivematica/MCPClient/clientScripts/store_file_modification_dates.py", line 62, in main
feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: transfer_file.modificationtime = get_modification_date(file_path)
feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: File "/usr/lib/archivematica/MCPClient/clientScripts/store_file_modification_dates.py", line 39, in get_modification_date
feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: mod_time = os.path.getmtime(file_path)
feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: File "/usr/share/archivematica/virtualenvs/archivematica-mcp-client/lib64/python2.7/genericpath.py", line 54, in getmtime
feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: return os.stat(filename).st_mtime
feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: OSError: [Errno 2] No such file or directory: '/cephfs/sharedDirectory/watchedDirectories/workFlowDecisions/selectFormatIDToolTransfer/Images_2-3b3261ef-4b40-4e48-91db-4a5eedf7344c/objects/799px-Eurol
feb 21 15:49:28 archivematica-client-dev1.cern.ch python[24646]: =============== END STDERR ===============

The transfer cannot be found workFlowDecisions/selectFormatIDToolTransfer/. This makes sense, because earlier in the "Move to extract packages" job the transfer is moved to workFlowDecisions/extractPackagesChoice/. If I look on disk, the file also resides in the new location on disk.

When I look into how store_file_modifaction_dates.py gets the location of the files. I see that this is retrieved via some database calls. This means that somehow the file path of the Files in the transfer is wrong.

It seems that this location is based on a replacement of %transferDirectory% with the current location of the transfer. This current location is updated in the database during the move transfer job. When I look in the later failedTransferCleanup, this also fails because it is looking for the Transfer in the wrong/old location.

If I check the current location of the transfer in the database it says: %sharedPath%failed/Images_2-3b3261ef-4b40-4e48-91db-4a5eedf7344c. So, unfortunately, I cannot see what the value was at the time of the error. But I am pretty sure it was workFlowDecisions/selectFormatIDToolTransfer/. Is it at all possible that there might be a race condition between the database and the jobs, or do we have a dropped update query here?

Steps to reproduce Since I think that this is a race condition/dropped update I think it is hard to reproduce reliably. I think I might have triggered it by having a setup with multiple MCP Clients, and mcp_server.concurrent_pagkes to 1. This way the MCP Clients/Gearman are very responsive, potentially racing the database updates?

Your environment (version of Archivematica, operating system, other relevant details) Archivematica qa/1.x, MySQL 5.7. The dashboard, MCP Server, Storage Service and database all run on separate nodes. Two MCP Clients on different nodes with the shared directory on a CephFS share.


For Artefactual use:

Before you close this issue, you must check off the following:

jorikvankemenade commented 4 years ago

Another slightly related problem is that the updateDB operation is not conditional on the rename actually working. This means that if for whatever reason the rename fails, the location of the Transfer is wrong because it is updated before the move actually happens.

I am not sure if this is worth a separate issue, let me know and I'll create one.

jorikvankemenade commented 4 years ago

Today had another example of this:

image

This is first a move from currentlyProcessing to workFlowDecisions/createTree. And then one from currentlyProcessing to currentlyProcessing which of course fails because of the earlier situation. This is again because of a database update that is dropped, missed, or has a race. I am not sure.

In a normal situation this is:

image

This is first a move from currentlyProcessing to workFlowDecisions/createTree and back.

And here a last example. A file that has been moved to examine contents, but is believed to still be in the workFlowDecisions/extractPackagesChoice:

image

Another thing that I noticed is that when I increase the number of MCP Clients, the number of failures of this type also increases. To me, this is an indication of a race condition between the MCP Clients. Probably something like a read before update race, Anyone familiar with the update intrinsics of Django?

arthytrip commented 4 years ago

I am of a similar opinion to Jorik and would like to illustrate some extreme cases. I apologize in advance for the length of this post but unfortunately the situation is delicate because, as Jorik says, the problem increases with the scaling of the system and produces unexpected results.

For the tests that I’ll describe here below, 20 packages were used, each with a medium sized pdf document of approximately 8MB. A single standard transfer "LOAD20-00X" was launched containing the sequence of the 20 packages (add, add, add ... and then Start), which then created 20 Transfers, named for example "LOAD20-001", ..., "LOAD20- 001_19". All tests were executed with the same Processing Configuration and in the same IT infrastructure.

Case 1: Simple installation with only one host --> all ok In an instance of Archivematica 1.10, hosted in a single CentOS 7 VM with 2 CPUs, all the packages have been correctly processed.

Case 2: Scaled installation with 1 dedicated MCPClient --> some trouble The same packages were then used in a scaled landscape, consisting of 5 hosts including 1 dedicated MCPClient with 8CPU and 2 active MCPClient instances.

19 processes have been successful. 1 process presented an anomaly because, despite having reached the end (AIP and DIP in the Storage Service), the Transfer appeared duplicated and stopped waiting for the "Assign UUIDs to directories?" Job; which for me is inexplicable!

image image

There was a residual LOAD20-002_9-b54f5811-733a-48fa-8cd9-9ebd77503dc6processingMCP.xml file in the sharedDirectory/currentlyProcessing.

Case 3: Scaled installation with 2 dedicated and enpowered MCPClient --> worst! Finally, always the same packages were used in a scaled landscape, now made up of 6 hosts, 2 of which are MCPClient, each with 8CPU and 8 MCPClient instances active.

These were the results.

4 Transfer ("LOAD20-0012", "... 5", "... 12" and "... 17") have failed at the Job "Create SIP from transfer objects" with the following error, the same for all (note the green dot at the top left that would indicate the successful execution of the Transfer which failed instead). image image

The Ingest "LOAD20-001_18" failed at the "Verify AIP" job with the following error (also note here the green dot that makes the Ingest seem correctly executed). image image

In this case, in the MCPClient-1 log, I found the following lines (this is the only error present in the MCPClient log files).

INFO      2020-03-10 15:26:58  archivematica.mcp.client:archivematicaClient:execute_command:206:  

*** RUNNING TASK: verifyaip_v1.0
WARNING   2020-03-10 15:26:58  archivematica.common:databaseFunctions:getAMAgentsForFile:138:  File with UUID 8c783251-001d-4e92-8026-a344b2e9cb87 does not exist in database; unable to fetch Agents
INFO      2020-03-10 15:26:58  archivematica.mcp.client:archivematicaClient:write_task_results_callback:215:  

*** Completed job: #<verifyaip_v1.0; exit=0; code=success uuid=e2e6eb72-26c3-4a30-8b90-a1fb2838730b
=============== STDOUT ===============
Failed to write PREMIS event to database. Error: (1452, 'Cannot add or update a child row: a foreign key constraint fails (`MCP`.`Events`, CONSTRAINT `Events_fileUUID_8410512400a63bf_fk_Files_fileUUID` FOREIGN KEY (`fileUUID`) REFERENCES `Files` (`fileUUID`))')
All checksums (count=6) generated at start of transfer match those generated by BagIt (bag).

=============== END STDOUT ===============
=============== STDERR ===============

=============== END STDERR ===============

The Ingest “LOAD20-001_16” failed in a “Move to processing directory” job (also green dot here). image image

I found no trace of all these errors, neither in the MCPServer log nor in the MCPClient log, nor in the Storage Service log, nothing even in the MariaDB log.

In the sharedDirectory/currentlyProcessing there are directories referring to the Transfers that have gone into error (note the different permissions that I don't explain).

# ls /var/archivematica/sharedDirectory/currentlyProcessing -l
total 24
drwxr-xr-x 5 archivematica archivematica 4096 Mar 10 16:51 LOAD20-001_12
drwxrwxr-x 5 archivematica archivematica 4096 Mar 10 16:49 LOAD20-001_14-ffd60967-232d-4d45-bea4-dea51a5beb11
drwxr-xr-x 5 archivematica archivematica 4096 Mar 10 16:51 LOAD20-001_17
drwxr-xr-x 5 archivematica archivematica 4096 Mar 10 16:51 LOAD20-001_2
drwxr-xr-x 5 archivematica archivematica 4096 Mar 10 16:51 LOAD20-001_5
drwxrwxr-x 5 archivematica archivematica 4096 Mar 10 16:49 LOAD20-001-958698b4-1476-4287-a58c-33947770ffd0

With the following tree.

# tree /var/archivematica/sharedDirectory/currentlyProcessing
/var/archivematica/sharedDirectory/currentlyProcessing
├── LOAD20-001_12
│   ├── logs
│   │   └── fileMeta
│   ├── metadata
│   │   └── submissionDocumentation
│   └── objects
├── LOAD20-001_14-ffd60967-232d-4d45-bea4-dea51a5beb11
│   ├── logs
│   │   ├── bulk-ab00a34c-f74e-4e9a-85d4-842a65e70968
│   │   │   ├── domain_histogram.txt
│   │   │   ├── domain.txt
│   │   │   ├── exif.txt
│   │   │   ├── report.xml
│   │   │   ├── telephone_histogram.txt
│   │   │   ├── telephone.txt
│   │   │   ├── url_histogram.txt
│   │   │   ├── url_services.txt
│   │   │   └── url.txt
│   │   ├── fileFormatIdentification.log
│   │   ├── fileMeta
│   │   └── filenameCleanup.log
│   ├── metadata
│   │   ├── directory_tree.txt
│   │   ├── metadata.csv
│   │   └── submissionDocumentation
│   │       ├── ef_b_029-382480.xml
│   │       └── METS.xml
│   ├── objects
│   └── processingMCP.xml
├── LOAD20-001_17
│   ├── logs
│   │   └── fileMeta
│   ├── metadata
│   │   └── submissionDocumentation
│   └── objects
├── LOAD20-001_2
│   ├── logs
│   │   └── fileMeta
│   ├── metadata
│   │   └── submissionDocumentation
│   └── objects
├── LOAD20-001_5
│   ├── logs
│   │   └── fileMeta
│   ├── metadata
│   │   └── submissionDocumentation
│   └── objects
└── LOAD20-001-958698b4-1476-4287-a58c-33947770ffd0
    ├── logs
    │   ├── bulk-2efa6130-94d9-4d4c-9090-00c6b1130091
    │   │   ├── domain_histogram.txt
    │   │   ├── domain.txt
    │   │   ├── exif.txt
    │   │   ├── report.xml
    │   │   ├── url_histogram.txt
    │   │   ├── url_services.txt
    │   │   └── url.txt
    │   ├── fileFormatIdentification.log
    │   ├── fileMeta
    │   └── filenameCleanup.log
    ├── metadata
    │   ├── directory_tree.txt
    │   ├── metadata.csv
    │   └── submissionDocumentation
    │       ├── ef_b_011-382462.xml
    │       └── METS.xml
    ├── objects
    └── processingMCP.xml

38 directories, 30 files

At this point, I restarted the MCPServer and... almost all the failed Transfer/Ingest restarted! image

The Transfers “LOAD20-001_2”, “..._5”, “..._12” e “..._17” have all restarted and have been completed correctly, as seen for example from here below. image

The Ingest “LOAD20-001_18”, which had previously failed in the "AIP verification" process, was restarted and ended positively. image

The Ingest “LOAD20-001_16” previously failed in a "Move to processing directory" job, now restarted and - without respecting the starting processing configuration - asked to make the choices manually, as if the process configuration contained "None" in all the items that provide for it (Bind PIDs, Compression level, Store AIP, Store AIP Location and so on ...). image

All this until the “Verify AIP” job fails with the following error. image image

At this point these two directories have remained in the sharedDirectory/currentlyProcessing (with special permissions, see above).

# ls /var/archivematica/sharedDirectory/currentlyProcessing -l
total 8
drwxrwxr-x 5 archivematica archivematica 4096 Mar 10 16:49 LOAD20-001_14-ffd60967-232d-4d45-bea4-dea51a5beb11
drwxrwxr-x 5 archivematica archivematica 4096 Mar 10 16:49 LOAD20-001-958698b4-1476-4287-a58c-33947770ffd0

Yet both the relative Transfer/Ingest have been completed correctly, the elaborated packages are present in the AIP location and can be downloaded from the Archival Storage interface of the Dashboard.

All these anomalies are not present in simple installations but impact the scaled systems and make them absolutely unreliable!

Does anyone have an idea if there are any parameters of Django, Gearman or anything else that can solve the situation?

jorikvankemenade commented 4 years ago

4 Transfer ("LOAD20-0012", "... 5", "... 12" and "... 17") have failed at the Job "Create SIP from transfer objects" with the following error, the same for all (note the green dot at the top left that would indicate the successful execution of the Transfer which failed instead).

This error is the same as we have encountered in #1124. This problem might be related to the issue we are discussing here, but we are not sure yet. It is good to know that the problem is also there on NFS hosted shared directories, not only on CephFS hosted directories.

The Ingest “LOAD20-001_16” failed in a “Move to processing directory” job (also green dot here).

This could be an example of a filesystem race condition I think. The move transfer tries to move the file from the approveNormalization directory to the currentlyProcessing directory. This is an opposite move with respect to the earlier move to approve normalization directory move. However, the job first sees the directory, but it is gone upon removing. Indicating that the NFS updates interleave with this job.

I found no trace of all these errors, neither in the MCPServer log nor in the MCPClient log, nor in the Storage Service log, nothing even in the MariaDB log.

In my install, the MCP Clients log to system logs, not to a log file. Have you tried checking the system logs using journalctl for example?

At this point, I restarted the MCPServer and... almost all the failed Transfer/Ingest restarted!

There can be multiple reasons for that. First of all, how many of the "Failed transfer" jobs have succeeded successfully? If a transfer is still in one of the watched directories because the clean up failed, the transfer will be restarted. There can also be some other reasons why Archivematica tries to get the failed transfers through the system. I don't think this is a problem perse, depending on the end result. I do think that the successful restarts show that the transfers themselves are correct, it was merely the system that was somehow in an inconsistent state. Which is in line with this and several other related issues.

sevein commented 4 years ago

@jorikvankemenade, could you give https://github.com/artefactual/archivematica/pull/1587 a try? It's merged already. When I started looking into #1141 I didn't realise that it was just another manifestation of the same problem. You're a few steps ahead of me! But my fix may be a good compromise for AM v1.11 hopefully?

Another thing that I noticed is that when I increase the number of MCP Clients, the number of failures of this type also increases. To me, this is an indication of a race condition between the MCP Clients.

You're right. @scollazo saw similar issues and he's got multiple MCPClients deployed too. We'll have to look into that update query. I've started to think that it is due to a lost update related to database locking...

All these anomalies are not present in simple installations but impact the scaled systems and make them absolutely unreliable!

Thanks for the detailed report! I haven't looked into #1124 yet, @arthytrip. We don't have much experience at scale with shared filesystems. Often I've seen solutions that scale by deploying multiple pipelines as opposed to clustering MCPClient, so this is really good feedback!

arthytrip commented 4 years ago

@jorikvankemenade

I found no trace of all these errors, neither in the MCPServer log nor in the MCPClient log, nor in the Storage Service log, nothing even in the MariaDB log.

In my install, the MCP Clients log to system logs, not to a log file. Have you tried checking the system logs using journalctl for example?

You're right, analyzing the journalctl I found the corresponding errors, for example for the transfer LOAD20-001_18: Tue 10 16:54:08 dkl-pocam-mcpcli10.dedacenter.services python [28952]: mv: cannot stat â € ˜ / var / archivematica / sharedDirectory / watchedDirectories / storeAIP / LOAD20-001_18-57a565fd-da26-4996-9501 -1a39d1d58a59â € ™: Device or resource busy or for the transfer LOAD20-001_16: Tue 10 16:53:29 dkl-pocam-mcpcli10.dedacenter.services python [29024]: OSError: [Errno 2] No such file or directory: '/var/archivematica/sharedDirectory/currentlyProcessing/./LOAD20-001_16-314e5627 -b11a-4c3b-b75b-65a542513858 '

At this point, I restarted the MCPServer and... almost all the failed Transfer/Ingest restarted!

There can be multiple reasons for that. First of all, how many of the "Failed transfer" jobs have succeeded successfully? If a transfer is still in one of the watched directories because the clean up failed, the transfer will be restarted. There can also be some other reasons why Archivematica tries to get the failed transfers through the system. I don't think this is a problem perse, depending on the end result. I do think that the successful restarts show that the transfers themselves are correct, it was merely the system that was somehow in an inconsistent state. Which is in line with this and several other related issues.

Yes indeed I agree. Blinded by other problems, I did not reflect on the fact that the ability to resume the functioning of a process after a "catastrophic" event is a plus. Now, however, I would be curious to know what happens if, before restarting the system, I deleted the contents of the watchedDirectories; would I get a misalignment with the database? But I don't want to go off topic.

sevein commented 4 years ago

Something that we could try is to commit the transaction that move_transfer begins sooner rather than later. We use transaction.atomic as a context manager but we don't exit the context until the very end of the call function:

def call(jobs):
    with transaction.atomic():
        for job in jobs:
            with job.JobContext():
                do_work()

So it's possible that it's just that MCPServer is looking at the old value because the transaction started by move_transfer has not been committed yet by the time that MCPServer looks it up:

MCPClient (move_transfer)           [DB]    MCPServer (get_or_create_from...)
______________________________________________________________________________
                                    |  |
- SQL: ---------------------------» |  |  
    BEGIN;                          |  |  
    UPDATE Transfer                 |  |  
    SET currentlocation = 'B/PKG'   |  |  
    WHERE transferUUID = '12345';   |  |  
- FS:                               |  |  
    mv /shared/A/PKG /shared/B      |  |  
- EXIT job.JobContext()             |  |  
                                    |  |   - WATCHED_DIR PKG-12345 FOUND!
                                    |  | «-- SQL:
                                    |  |       BEGIN;
                                    |  |       SELECT *
                                    |  |       FROM Transfer
                                    |  |       WHERE transferUUID = '12345';
                                    |  | «-- SQL:
                                    |  |       COMMIT;
                                    |  |       SEEN 'A/PKG'
                                    |  |       
- SQL: ---------------------------» |  |       
    COMMIT;                         |  |  
- END                               |  |  

https://github.com/artefactual/archivematica/pull/1587 mitigates this issue by having MCPServer ensuring that the right location is set before the processing of the chain starts.

sevein commented 4 years ago

Another slightly related problem is that the updateDB operation is not conditional on the rename actually working. This means that if for whatever reason the rename fails, the location of the Transfer is wrong because it is updated before the move actually happens.

You're right. transaction.atomic would roll back the transaction if rename raised an exception but it doesn't - we should update the implementation so an error in rename() raises.

Another thing that I noticed is that when I increase the number of MCP Clients, the number of failures of this type also increases. To me, this is an indication of a race condition between the MCP Clients. Probably something like a read before update race, Anyone familiar with the update intrinsics of Django?

You were so right! We've also found that when using inotify the problem appears more frequently which makes sense since our callbacks to watched directory events are fired off sooner using the inotify API.

sevein commented 4 years ago

Not actively working on this issue but I thought I would try to inject and artificial delay in move_transfer to prove it and I see now the error happens consistently!

def call(jobs):
    with transaction.atomic():
        for job in jobs:
            with job.JobContext():
                do_work()
        time.sleep(2)  # <------------ [2s delay]
arthytrip commented 4 years ago

I had tried putting an artificial delay in reloadFileList MCPServer, as suggested here, but absolutely nothing has improved (and perhaps I shouldn't have expected it). @sevein I have seen that you have set this problem in triage for 1.12. Tell me that's not true! Is there no possibility for 1.11? This little problem is not a joke!

sevein commented 4 years ago

Hi @arthytrip. My understanding is that the problem that you're experiencing is specific to shared filesystems and better framed in https://github.com/archivematica/Issues/issues/1124. Let's move this conversation there!