archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: Heavy MySQL load with large Tasks table #1394

Open mamedin opened 6 years ago

mamedin commented 6 years ago

On Archivematica systems, the MySQL Tasks table can grow very big (several million records). When the MCP service is restarted the records with the null exitCode field are updated with the following query:

UPDATE Tasks SET stdError='MCP shut down while processing.', exitCode=-1 WHERE exitCode='NULL';

This update query may take a long time (several minutes) and the dashboard could be unresponsive meanwhile.

mamedin commented 6 years ago

The table Tasks has two index for fields: taskUUID and jobuuid.

The exitCode field could be indexed.

jhsimpson commented 6 years ago

@mamedin I have been doing some testing on this - the exitCode column is defined as a bigint, but it doesn't really need to be.

I tested on a system with a Tasks table containing 18 million rows. The query you posted (something quite close at least) took just under 3 minutes to run from a mysql command prompt.

I tried 2 changes

1) alter the table to change exitCode from a bigint to an int (took about 15 minutes to run) I ran alter table Tasks modify exitCode int; (with Archivematica processes halted).

2) add an index to exitCode (took about 3 1/2 minutes to run) I ran alter table Tasks add index idx_Tasks_exitCode (exitCode); After this the update query took less than a second.

I am going to see if I can create a django migration to apply this change.
I haven't tested to see if there is a performance impact on writing to the Tasks table. I assume this will not be large, but I'm not sure.

cole commented 4 years ago

This explains why my mcpserver takes a minute to start up (500k task rows). The exitCode column is kind of low cardinality for an index (the only values I see used are NULL,-1,179,1,2, and 255), but it sounds like it works.

ross-spencer commented 3 years ago

Related to https://github.com/archivematica/Issues/issues/1239