archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: encountering "OSError: Device or resource busy" #1124

Open jorikvankemenade opened 4 years ago

jorikvankemenade commented 4 years ago

Expected behaviour When a task or job finishes, all system resources are released.

Current behaviour During some performance/load tests I occasionally ran into the following problem:

image

The Archivematica task tries to use a resource, that is currently being used. This is typically something that happens if a new task uses resources that are not yet released by a previous task. This could be an indication of the MCP Server dispatching new tasks before any of the old tasks are completely finished.

Steps to reproduce I don't have a reliable way to reproduce this yet. I can "trigger" this problem by running many small transfers using many MCP Clients. A typical setup where I would see this was if I transferred ~50 photo albums on a cluster with 4 MCP Client nodes, each node running 3/4 clients. During the transfer of the 50 albums, this error would typically show up 0-4 times.

Your environment (version of Archivematica, operating system, other relevant details) Archivematica qa 1/x. The MCP and SS databases are hosted on MySQL 5. The shared directory is hosted on a CephFS share. The rest of the Archivematica components are deployed on their own VM.


For Artefactual use:

Before you close this issue, you must check off the following:

sevein commented 4 years ago

The shared directory is hosted on a CephFS share.

@jorikvankemenade, have you been able to reproduce this one without CephFS?

jorikvankemenade commented 4 years ago

I don't have an environment available for that. The single node instances I ran in the past could only run ~2-3 MCP clients reliably before suffering too much from overutilization. I think this problem only shows up in systems where the next task can be claimed and started very fast, but this is just a hunch. I can see if I can reproduce it locally in a Docker setup, but I don't have very high hopes for that.

jorikvankemenade commented 4 years ago

@sevein if you check the new entries in #1111 you will see that another user had similar problems using an NFS mounted shared directory. So the problem is not Ceph related, it seems to be correlated with "scaling out" Archivematica.

sevein commented 4 years ago

@sevein I have seen that you have set this problem in triage for 1.12. Tell me that's not true! Is there no possibility for 1.11? This little problem is not a joke!

We have a short-term solution for #1111 (see https://github.com/artefactual/archivematica/pull/1587). @jorikvankemenade @arthytrip, I'd be curious to see if that's also fixing this issue described here (device or resource busy).

If https://github.com/artefactual/archivematica/pull/1587 doesn't cut it, as much as I'd like to see this fixed promptly, we'd still need to understand what's happening. A fix wouldn't be possible in v1.11 because we've already started final testing. triage-release-1.12 is an umbrella for the next release, but that might be as well a patch release happening sooner.

I'd like to encourage you to continue your investigations and report your findings! Have you tried tweaking the NFS mount options - see https://github.com/archivematica/Issues/issues/613#issuecomment-480352466. Would that help?

arthytrip commented 4 years ago

All my tests from the last weekend go in the direction of a problem of "scaling" and competition of the MCPClient instances, as already stated by Jorik. It is a bit difficult to summarize all the tests made, but in my landscape one fact emerges: the errors "Device or resource busy" and "No such file or directory" always occur in the presence of multiple MCPClient instances in the same host.

In the last tests I have distributed the MCPClients in 4 hosts and I have played with a number of CPU varying from 2 to 8 for each host.

All tests with a single MCPClient instance in each host were successful, regardless of the number of hosts employed (with one exception, a bit strange: a Transfer literally disappeared from the interface of the Dashboard, it was not performed and left traces of itself only in sharedDirectory/tmp and sharedDirectory/watchedDirectories/activeTransfers).

All tests in which I activated more than one instance of MCPClient on the same host had the problems indicated above.

The competition from MCPClient instances in one host is sufficient to trigger the problem, even when the host's computational resources are absolutely not stressed.

Initially I had detected an "error opening dir" problem that strangely characterized few subdirectories in the sharedDirectory when they were read by other hosts (and in my opinion there may actually also be a problem of unexplicably variable permissions within the sharedDirectory). However, once I fix this problem with the NFS settings (sync,no_wdelay,hide,sec=sys,rw,secure,no_root_squash,no_all_squash), this has no longer happened and now I see failures only with multiple instances of MCPClient, for the same bunches of packages.

I also tried to insert a few seconds of delay in some points suggested by Jesus and others (create_sip_from_transfer_objects.py at line 127 and in createUnitAndJobChain in archivematicaMCP.py at line 52), but it did not help.

Unfortunately I'm not in a position to install the development branch for tests and I don't use Ansible role. If there are targeted tests that I can do, I willingly make myself available.

mamedin commented 4 years ago

Hi @arthytrip,

The easiest way to test AM qa/1.x from packages is using the installation script from archivematica-docs repo. You only need to change the AM rpm repos. This script is the base of https://www.archivematica.org/en/docs/archivematica-1.10/admin-manual/installation-setup/installation/install-centos/# doc (Except Post-install configuration steps: 3-5)

1) Download AM1.10 script:

sudo yum -y update
sudo yum -y install wget
wget https://raw.githubusercontent.com/artefactual/archivematica-docs/1.11/admin-manual/installation-setup/installation/scripts/am-centos-rpm.sh

2) Change rpm repositories in script:

Change am-centos-rpm.sh file from:

sudo -u root bash -c 'cat << EOF > /etc/yum.repos.d/archivematica.repo
[archivematica]
name=archivematica
baseurl=https://packages.archivematica.org/1.10.x/centos
gpgcheck=1
gpgkey=https://packages.archivematica.org/1.10.x/key.asc
enabled=1
EOF'

sudo -u root bash -c 'cat << EOF > /etc/yum.repos.d/archivematica-extras.repo
[archivematica-extras]
name=archivematica-extras
baseurl=https://packages.archivematica.org/1.10.x/centos-extras
gpgcheck=1
gpgkey=https://packages.archivematica.org/1.10.x/key.asc
enabled=1
EOF'

To:

sudo -u root bash -c 'cat << EOF > /etc/yum.repos.d/archivematica.repo
[archivematica]
name=archivematica
baseurl=https://jenkins-ci.archivematica.org/repos/am-packbuild/1.11.0-beta/centos7
gpgcheck=0
gpgkey=""
enabled=1
EOF'

sudo -u root bash -c 'cat << EOF > /etc/yum.repos.d/archivematica-extras.repo
[archivematica-extras]
name=archivematica-extras
baseurl=https://packages.archivematica.org/1.11.x/centos-extras
gpgcheck=1
gpgkey=https://packages.archivematica.org/1.11.x/key.asc
enabled=1
EOF'

3) Run script:

chmod +x am-centos-rpm.sh
sudo ./am-centos-rpm.sh

4) Check packages:

[centos@mamedin-test-centos ~]$ sudo rpm -qa | grep -i archivematica
archivematica-common-1.11.0~beta.1-9.x86_64
archivematica-dashboard-1.11.0~beta.1-9.x86_64
archivematica-mcp-client-1.11.0~beta.1-9.x86_64
archivematica-storage-service-0.16.0~beta.1-9.x86_64
archivematica-mcp-server-1.11.0~beta.1-9.x86_64

To upgrade new packages (I have created new packages this morning), just clean yum cache and upgrade AM packages:

sudo yum clean all -y
sudo yum upgrade archivematica* -y

And now my test VM is using beta.1-10 packages:

[centos@mamedin-test-centos ~]$ sudo rpm -qa | grep -i archivematica
archivematica-mcp-server-1.11.0~beta.1-10.x86_64
archivematica-common-1.11.0~beta.1-10.x86_64
archivematica-dashboard-1.11.0~beta.1-10.x86_64
archivematica-storage-service-0.16.0~beta.1-10.x86_64
archivematica-mcp-client-1.11.0~beta.1-10.x86_64

5) Configure AM:

The SS user and password is test (configured in script):

https://github.com/artefactual/archivematica-docs/blob/1.11/admin-manual/installation-setup/installation/scripts/am-centos-rpm.sh#L75-L76

The SS API key was generated automatically, and that key will connect the Archivematica pipeline to the Storage Service API. The API key can be found via the SS web interface (go to Administration > Users).

To finish the installation, use your web browser to navigate to the Archivematica dashboard using the IP address of the machine on which you have been installing, e.g., http://:81 (or http://localhost:81 or http://127.0.0.1:81 if this is a local development setup).

At the Welcome page, create an administrative user for the Archivematica pipeline by entering the organization name, the organization identifier, username, email, and password.

On the next screen, connect your pipeline to the Storage Service by entering the Storage Service URL and username, and by pasting in the API key that you copied in Step (2).

If the Storage Service and the Archivematica dashboard are installed on the same machine, then you should supply http://127.0.0.1:8001 as the Storage Service URL at this screen. If the Storage Service and the Archivematica dashboard are installed on different nodes (servers), then you should use the IP address or fully-qualified domain name of your Storage Service instance, e.g., http://:8001 and you must ensure that any firewall rules (i.e., iptables, ufw, AWS security groups, etc.) are configured to allow requests from your dashboard IP to your Storage Service IP on the appropriate port.

See: https://www.archivematica.org/en/docs/archivematica-1.10/admin-manual/installation-setup/installation/install-centos/#post-install-configuration

arthytrip commented 4 years ago

Hi @mamedin, you are suggesting me to create a new installation, from scratch, using all hosts! Right now I don't have the chance, but I keep your suggestions, especially the initial part on pointing the sources. My current purpose is to design a scaled-up system that guarantees certain performances (throughput). If I reused the same hosts to install a version under development, I could not then propose the solution for a productive environment. Making changes to the sources in production, if a must, I can do it (very carefully) and consider it a temporary patch, but I cannot install a version in development in a production environment. I will talk about it with my colleagues but I think I will - unfortunately - have to wait for version 1.11, if not even 1.12. which, I confess, puts me well in trouble ...

mamedin commented 4 years ago

@arthytrip please don't upgrade production servers, I was thinking about a test on a dev server.

arthytrip commented 4 years ago

It is a test system, but aimed at a production landscape. Don't worry, when I said that "I will talk about it with my colleagues ..." I was not clear, I was referring to the opportunity to also have a development system.

arthytrip commented 4 years ago

I just executed an extemporaneous test on a single PDF file of 320MB. I hadn't even planned it, I had to do it for other reasons.

Case 1: 4 dedicated hosts with 1 MCPClient instance each Result: Failed at Normalize/Move to processing directory with "OSError: [Errno 2] No such file or directory: ..."

Case 2: 1 dedicated host with 1 MCPClient instance Result: Success

I don't know what to think...