archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: MCPClient scripts can freeze server #941

Open jorikvankemenade opened 4 years ago

jorikvankemenade commented 4 years ago

Please describe the problem you'd like to be solved Even when running heavy loads on an Archivematica instance the storage service and dashboard should be able to respond to incoming requests.

When heaving a heavy load on the system. I.e. multiple MCP Clients that run the compress AIP task. This spawns a lot of 7zip threads that block other threads on the system. In several cases, this caused a time-out from the storage service when requesting the CP location for example. I think that core and latency-sensitive components of Archivematica should have priority over the client process/microservices.

Describe the solution you'd like to see implemented I have done some small experiments where I've set the priority of the Archivematica MCP Client(s) on my system to a lower priority using the Nice setting in the service. This way I could overutilise my system while still being able to use the dashboard etcetera.

For reference, I used this systemd file:

[Unit]
Description=Archivematica MCP Client Service
After=syslog.target network.target

[Service]
Type=simple
User=archivematica
EnvironmentFile=/etc/sysconfig/archivematica-mcp-client
Environment=PATH=/usr/share/archivematica/virtualenvs/archivematica-mcp-client/bin/
ExecStart=/usr/share/archivematica/virtualenvs/archivematica-mcp-client/bin/python /usr/lib/archivematica/MCPClient/archivematicaClient.py
Nice=10

[Install]
WantedBy=multi-user.target

Where Nice should be a value between -20 and 19, where -20 is the highest priority and 19 the lowest priority. By default, user processes have priority 0. Some extra information can be found in the systemd documentation.

Describe alternatives you've considered Since one of the root causes seems to be the heavily multithreaded "Compress AIP" and "Index AIP" jobs I could limit the number of possible MCP Clients running that job. However, this adds complexity to the configuration and we might have to create more exceptions in the future.

This is just the result of some small non-exhaustive testing and options. I am open to discussing other options and hearing more opinions.

Additional context My testing is done on an 8 cores/16GB CentOS 7 VM with Archivematica 1.10 and SS 0.15.


For Artefactual use:

Before you close this issue, you must check off the following:

mamedin commented 4 years ago

Thanks @jorikvankemenade

That's a nice improvement!

mamedin commented 4 years ago

As alternative, it can be done with and override.conf file running:

sudo systemctl edit archivematica-mcp-client

Adding the lines:

[Service]
Nice=10

And reloading systemd units:

sudo systemctl daemon-reload

It creates the override file:

/etc/systemd/system/archivematica-mcp-client.service.d/override.conf

This way the change is persistent on AM updates.

cole commented 4 years ago

:+1: for nice!

@sevein may have some input here as I think this has been discussed in the past but we never documented it anywhere.

There's also some more systemd options out there with CPUShares etc, and even assigning dedicated CPUs per process.

@mamedin I am concerned though with nicing MCPServer & SS above 0, and dashboard negative — we already know that simply viewing the dashboard can really slow down processing, and SS can sometimes become non-responsive.

mamedin commented 4 years ago

@cole we are going to test with AMAUAT:

https://github.com/artefactual/deploy-pub/blob/dev/jenkins-vagrant-updates-systemd-nice/playbooks/archivematica-bionic/vars-singlenode-qa.yml#L15

archivematica_src_dashboard_systemd_nice: -10
archivematica_src_mcp_server_systemd_nice: 5
archivematica_src_mcp_client_systemd_nice: 10
archivematica_src_storage_service_systemd_nice: 1

We can try later with other values

jorikvankemenade commented 4 years ago

Thanks for the extra information. So far I mainly had troubles with the Storage Service timing out, but I tested with the SS on 0 (only had MCP Client on 10, rest on default 0). So I am looking forward to the results of @mamedin.