Open helenst opened 5 years ago
Hi @helenst.
We are running both 1.8.1 and 1.9.0 with MCPClients distributed over multiple servers. We are experiencing the same problems with "missing directories", probably due to synchronization of the shared filesystem.
I have filed some issues as well: #612 and #589
Posted a message on google groups also.
We are running CentOS 7 on oVirt with GlusterFS as shared storage.
@helenst, I haven't found any lead yet but I was wondering if you've had a chance to review your mount settings? Is it mounted with async
or sync
?
I've found a couple of interesting articles from Gitlab on NFS [1] [2]. It may be worth trying nfsvers=4.1
too.
@ThomasEdvardsen, thanks for your detailed reports. We'll investigate!
It does appear to be mounted with async, so sync may be worth a try! Thanks :)
Hi @helenst I noticed this issue while looking through the backlog this morning, does it look like it can be closed now, or is there a way do you think to resolve it helpfully? e.g. docs? And is it something that's still happening for you?
@ross-spencer We've been keeping MCP client and server on the same instance so it's not been a problem for us, although I guess the underlying issues are still there. Might be good to have in the docs so others can be aware.
Expected behaviour Archivematica workflow should run error free on sample transfers when deployed on multiple servers using mounted network filesystem for the shared directory (Specifically in our case, MCPServer and MCPClient deployed on different ECS instances and using a shared EFS filesystem)
Current behaviour In "Process submission documentation", the "Move metadata to objects directory" job is immediately before "Assign file UUIDs to metadata".
For the first of those, the
move_or_merge
MCPClient script performs move operations on the shared filesystem, which include the directory containingMETS.xml
. Once that script has completed, MCPServer moves on through the workflow and reads the filesystem to figure out which files it should runassign_file_uuids
on - this should includeMETS.xml
.However, MCPServer doesn't always find
METS.xml
so it doesn't run theassign_file_uuids
script on it and it doesn't get a UUID (or aFile
object), which causes later failures in the workflow: see https://github.com/wellcometrust/platform/issues/3510 and https://github.com/wellcometrust/platform/issues/3511It seems that the communication between MCPClient/MCPServer is faster than the shared filesystem, so changes in the filesystem made by one instance aren't seen quickly enough by another.
The same also happens in "Process metadata directory" with different effects (https://github.com/wellcometrust/platform/issues/3508 and https://github.com/wellcometrust/platform/issues/3491) but I believe the root cause is the same.
Testing with an artificial 5 second delay introduced before the file list is read (https://github.com/wellcometrust/archivematica/commit/f6fc67354d88a19a01ac39d2dacba38a277ca00b) stops these errors from happening.
For now, we are running both services in a single instance and this also seems to fix things. It would be good to be able to run them reliably on separate instances in future.
Steps to reproduce Deploy Archivematica on separate ECS instances, with shared directory on an EFS filesystem. Run transfers repeatedly through the system - these problems are seen intermittently. On our deployment, it was rare to see an ingest succeed.
Your environment (version of Archivematica, OS version, etc) On the qa/1.x branch (based on Archivematica 1.9), using Archivematica Dockerfiles deployed to AWS with terraform (https://github.com/wellcometrust/archivematica-infra/)
For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Verified in Waffle: