Problem: incompatible units of code cause intermittent errors in Storage Service

sevein commented 5 years ago

Expected behaviour

Storage Service has to process a large volume of tasks concurrently. We expect the tasks to cooperate (under a cooperative multitasking scheme) or be executed concurrently by other means without causing errors or intermittent slowness.

Current behaviour

Storage Service uses a Python library called gevent to achieve scalability provided by asynchronous IO and lightweight multi-threading (greenlets). The scheduling scheme expects all code to cooperate, i.e. they need to yield control to the scheduler. Some of our tasks do not cooperate, e.g. code that relies on lxml is not cooperative. As a result, users may experience slowness or unresponsiveness in the application.

We've also experienced a problem in our async task manager which runs a loop that misbehaves under these circumstances - when non-cooperative code is blocking for long periods of time, the manager is expiring and removing async tasks causing a variety of errors (see #257 and #425).

Additional context

Storage Service did not have support for deferred tasks until SS 0.12. All the tasks including heavy lifting IO operations were done synchronously. As a result, we noticed that our Gunicorn workers were frequently busy (e.g. sending a large file to a client) and the application would become not responsive when you ran out of free workers. It's always possible to provision more workers but we started looking at better ways of scaling.

We basically needed a job queue but we did not have the capacity to refactor the application to work that way. In SS 0.11, we decided to set the default Gunicorn worker class to gevent where each worker runs an event loop (AM_GUNICORN_WORKER_CLASS=gevent). The gevent worker class creates a pool of greenlets that run inside the same OS process - they're scheduled cooperatively, i.e. only one greenlet is running at a given time but the scheduler is going to switch contexts (move on to the next greenlet) when a task is busy doing IO work. The standard library is monkey patched so all Python-code becomes gevent-friently (aka cooperative).

With the change to gevent we saw that a single worker could handle most of the usual application load. However, gevent workers may sporadically become blocked by non-cooperative code. E.g.: we've found that this is the case when our code relies on modules that make use of C-extensions, like lxml.

In SS 0.12, we introduced interim support for deferred tasks which is still present and it's been used for a few releases now. In order to avoid adding more complexity to our deployments, we had the async manager added to the Gunicorn worker (as opposed to out of process as you would usually do in Celery), i.e. each worker runs a copy of AsyncManager and deferred jobs are executed as threads (or greenlets - if using gevent) within each worker available.

Every task that takes long enough to run should be deferred to the task manager but so far we've only added a few code paths to leverage this new mechanism. Under these circumstances, when the gevent worker class is being deployed, we've found that non-cooperative code is preventing the async manager to work properly - it seems that tasks are being considered expired and deleted (see https://github.com/archivematica/Issues/issues/425, likely because async_manager.TASK_TIMEOUT_SECONDS is being exceeded.

On the Archivematica side, this manifests as follows: when storing an AIP, which may take a long time and it’s a code path in Storage Service known to have non-cooperative code, the SS client polls for status of the deferred task until it completes. When Storage Service accidentally deletes the expired task, the API returns a 404 error and Archivematica gives up. This error is not recoverable and retrying is not going to help.

Potential short-term solutions (Archivematica 1.10+)

The simplest solution is to roll back the changes in Archivematica that make use of the asynchronous endpoints (see #425 for more). We did not experience these errors until we introduced that change. The disadvantage to this solution is that we lose the benefits of the asynchronous endpoints. One of the primary reasons these were introduced was to work around connectivity issues with certain load balancers setting connection idle timeouts. Those issues will still occur in certain environments, so to accommodate those situations, we can provide a configuration parameter that allows administrators to choose between the two different approaches.
Run AsyncManager.watchdog and async tasks as native threads (as opposed to greenlets) using gevent.threadpool with a configurable size. If the pool exhausts, the tasks would be kept in memory as opposed to blocking the caller, which is similar to what we do now where we’re launching an unbound number of threads. gevent.threadpool should expose the native standard library to threads, i.e. not monkey-patched.
Set Gunicorn worker class to sync. The async manager and async tasks would become native threads. Users would need to provision Gunicorn workers as needed.
Identify non-cooperative code (via gevent monitoring, e.g. see this report) and make code cooperative where possible, e.g. using gevent.sleep()?

In the last two cases, we'll likely see "Database is locked" errors when using SQLite (for more see https://docs.djangoproject.com/en/2.2/ref/databases/#database-is-locked-errors). The user could switch to MySQL or increase the timeout database option.

We’ll need a more reliable solution in the long term (Archivematica 1.11+?), e.g. a disk queue, out of process workers, etc… We’ve used Celery before and we believe it’s a good solution to the problem.

Your environment (version of Archivematica, OS version, etc)

Archivematica 1.8 or newer

For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

All PRs related to this issue are properly linked 👍
All PRs related to this issue have been merged 👍
Test plan for this issue has been implemented and passed 👍
Documentation regarding this issue has been written and it has been added to the release notes, if needed 👍

sromkey commented 4 years ago

@sevein is this something you think we should resolve in the 0.16? Do you want to size it?

sevein commented 4 years ago

The issue can be mitigated by deploying more workers (relates to https://github.com/archivematica/Issues/issues/944 and https://github.com/archivematica/Issues/issues/952). It could be a good start for v0.16, with some extra comments in the scaling docs. A day or two, depending on how much testing we want to do, etc...

A long-term solution such adopting Celery is a major task that would probably take weeks of work.

sevein commented 4 years ago

Our scaling docs alrady describe how to deploy mutliple Gunicorn workers to gurantee more responsiveness which is our recommended workaround. I've submitted https://github.com/artefactual/archivematica-docs/pull/347 to clarify how to deal with potential errors when combining multiple workers in SS when the SQLite dbengine is used.

We can revisti this issue in further releases, e.g. refactor non-cooperative code, create thread pools, etc...

sallain commented 4 years ago

I've added the request discussion label because there are many paths we could take to address this issue. We need to come up with some options and then evaluate for the next release.

scollazo commented 4 years ago

As @sevein says, I have been able to export a test ss sqlite database, and import it into mysql using manage.py dumpdata / manage.py loaddata commands.

sromkey commented 4 years ago

Too large in scope for 1.12/0.17 also. We'll need to keep discussing!

archivematica / Issues

Problem: incompatible units of code cause intermittent errors in Storage Service #730

Potential short-term solutions (Archivematica 1.10+)