archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: incompatible units of code cause intermittent errors in Storage Service #730

Open sevein opened 5 years ago

sevein commented 5 years ago

Expected behaviour

Storage Service has to process a large volume of tasks concurrently. We expect the tasks to cooperate (under a cooperative multitasking scheme) or be executed concurrently by other means without causing errors or intermittent slowness.

Current behaviour

Storage Service uses a Python library called gevent to achieve scalability provided by asynchronous IO and lightweight multi-threading (greenlets). The scheduling scheme expects all code to cooperate, i.e. they need to yield control to the scheduler. Some of our tasks do not cooperate, e.g. code that relies on lxml is not cooperative. As a result, users may experience slowness or unresponsiveness in the application.

We've also experienced a problem in our async task manager which runs a loop that misbehaves under these circumstances - when non-cooperative code is blocking for long periods of time, the manager is expiring and removing async tasks causing a variety of errors (see #257 and #425).

Additional context

Storage Service did not have support for deferred tasks until SS 0.12. All the tasks including heavy lifting IO operations were done synchronously. As a result, we noticed that our Gunicorn workers were frequently busy (e.g. sending a large file to a client) and the application would become not responsive when you ran out of free workers. It's always possible to provision more workers but we started looking at better ways of scaling.

We basically needed a job queue but we did not have the capacity to refactor the application to work that way. In SS 0.11, we decided to set the default Gunicorn worker class to gevent where each worker runs an event loop (AM_GUNICORN_WORKER_CLASS=gevent). The gevent worker class creates a pool of greenlets that run inside the same OS process - they're scheduled cooperatively, i.e. only one greenlet is running at a given time but the scheduler is going to switch contexts (move on to the next greenlet) when a task is busy doing IO work. The standard library is monkey patched so all Python-code becomes gevent-friently (aka cooperative).

With the change to gevent we saw that a single worker could handle most of the usual application load. However, gevent workers may sporadically become blocked by non-cooperative code. E.g.: we've found that this is the case when our code relies on modules that make use of C-extensions, like lxml.

In SS 0.12, we introduced interim support for deferred tasks which is still present and it's been used for a few releases now. In order to avoid adding more complexity to our deployments, we had the async manager added to the Gunicorn worker (as opposed to out of process as you would usually do in Celery), i.e. each worker runs a copy of AsyncManager and deferred jobs are executed as threads (or greenlets - if using gevent) within each worker available.

Every task that takes long enough to run should be deferred to the task manager but so far we've only added a few code paths to leverage this new mechanism. Under these circumstances, when the gevent worker class is being deployed, we've found that non-cooperative code is preventing the async manager to work properly - it seems that tasks are being considered expired and deleted (see https://github.com/archivematica/Issues/issues/425, likely because async_manager.TASK_TIMEOUT_SECONDS is being exceeded.

On the Archivematica side, this manifests as follows: when storing an AIP, which may take a long time and it’s a code path in Storage Service known to have non-cooperative code, the SS client polls for status of the deferred task until it completes. When Storage Service accidentally deletes the expired task, the API returns a 404 error and Archivematica gives up. This error is not recoverable and retrying is not going to help.

Potential short-term solutions (Archivematica 1.10+)

In the last two cases, we'll likely see "Database is locked" errors when using SQLite (for more see https://docs.djangoproject.com/en/2.2/ref/databases/#database-is-locked-errors). The user could switch to MySQL or increase the timeout database option.

We’ll need a more reliable solution in the long term (Archivematica 1.11+?), e.g. a disk queue, out of process workers, etc… We’ve used Celery before and we believe it’s a good solution to the problem.

Your environment (version of Archivematica, OS version, etc)

Archivematica 1.8 or newer


For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

sromkey commented 4 years ago

@sevein is this something you think we should resolve in the 0.16? Do you want to size it?

sevein commented 4 years ago

The issue can be mitigated by deploying more workers (relates to https://github.com/archivematica/Issues/issues/944 and https://github.com/archivematica/Issues/issues/952). It could be a good start for v0.16, with some extra comments in the scaling docs. A day or two, depending on how much testing we want to do, etc...

A long-term solution such adopting Celery is a major task that would probably take weeks of work.

sevein commented 4 years ago

Our scaling docs alrady describe how to deploy mutliple Gunicorn workers to gurantee more responsiveness which is our recommended workaround. I've submitted https://github.com/artefactual/archivematica-docs/pull/347 to clarify how to deal with potential errors when combining multiple workers in SS when the SQLite dbengine is used.

We can revisti this issue in further releases, e.g. refactor non-cooperative code, create thread pools, etc...

sallain commented 4 years ago

I've added the request discussion label because there are many paths we could take to address this issue. We need to come up with some options and then evaluate for the next release.

scollazo commented 4 years ago

As @sevein says, I have been able to export a test ss sqlite database, and import it into mysql using manage.py dumpdata / manage.py loaddata commands.

sromkey commented 4 years ago

Too large in scope for 1.12/0.17 also. We'll need to keep discussing!