archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: Shared directory out of sync between ECS instances #613

Open helenst opened 5 years ago

helenst commented 5 years ago

Expected behaviour Archivematica workflow should run error free on sample transfers when deployed on multiple servers using mounted network filesystem for the shared directory (Specifically in our case, MCPServer and MCPClient deployed on different ECS instances and using a shared EFS filesystem)

Current behaviour In "Process submission documentation", the "Move metadata to objects directory" job is immediately before "Assign file UUIDs to metadata".

For the first of those, the move_or_merge MCPClient script performs move operations on the shared filesystem, which include the directory containing METS.xml. Once that script has completed, MCPServer moves on through the workflow and reads the filesystem to figure out which files it should run assign_file_uuids on - this should include METS.xml.

However, MCPServer doesn't always find METS.xml so it doesn't run the assign_file_uuids script on it and it doesn't get a UUID (or a File object), which causes later failures in the workflow: see https://github.com/wellcometrust/platform/issues/3510 and https://github.com/wellcometrust/platform/issues/3511

It seems that the communication between MCPClient/MCPServer is faster than the shared filesystem, so changes in the filesystem made by one instance aren't seen quickly enough by another.

The same also happens in "Process metadata directory" with different effects (https://github.com/wellcometrust/platform/issues/3508 and https://github.com/wellcometrust/platform/issues/3491) but I believe the root cause is the same.

Testing with an artificial 5 second delay introduced before the file list is read (https://github.com/wellcometrust/archivematica/commit/f6fc67354d88a19a01ac39d2dacba38a277ca00b) stops these errors from happening.

For now, we are running both services in a single instance and this also seems to fix things. It would be good to be able to run them reliably on separate instances in future.

Steps to reproduce Deploy Archivematica on separate ECS instances, with shared directory on an EFS filesystem. Run transfers repeatedly through the system - these problems are seen intermittently. On our deployment, it was rare to see an ingest succeed.

Your environment (version of Archivematica, OS version, etc) On the qa/1.x branch (based on Archivematica 1.9), using Archivematica Dockerfiles deployed to AWS with terraform (https://github.com/wellcometrust/archivematica-infra/)


For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

ThomasEdvardsen commented 5 years ago

Hi @helenst.

We are running both 1.8.1 and 1.9.0 with MCPClients distributed over multiple servers. We are experiencing the same problems with "missing directories", probably due to synchronization of the shared filesystem.

I have filed some issues as well: #612 and #589

Posted a message on google groups also.

We are running CentOS 7 on oVirt with GlusterFS as shared storage.

sevein commented 5 years ago

@helenst, I haven't found any lead yet but I was wondering if you've had a chance to review your mount settings? Is it mounted with async or sync? I've found a couple of interesting articles from Gitlab on NFS [1] [2]. It may be worth trying nfsvers=4.1 too.

@ThomasEdvardsen, thanks for your detailed reports. We'll investigate!

helenst commented 5 years ago

It does appear to be mounted with async, so sync may be worth a try! Thanks :)

ross-spencer commented 5 years ago

Hi @helenst I noticed this issue while looking through the backlog this morning, does it look like it can be closed now, or is there a way do you think to resolve it helpfully? e.g. docs? And is it something that's still happening for you?

helenst commented 5 years ago

@ross-spencer We've been keeping MCP client and server on the same instance so it's not been a problem for us, although I guess the underlying issues are still there. Might be good to have in the docs so others can be aware.