cul-it / exhibits-library-cornell-edu

Spotlight
3 stars 2 forks source link

Consider garbage collection for production database #409

Open elrayle opened 3 years ago

elrayle commented 3 years ago

Description

Production database is > 2.5GB. Most of that is in the searches table which is > 2GB.

To get the size of the the entire database:

See the size of each table:

select TABLE_NAME AS `Table`, ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024) AS `Size (MB)` from information_schema.TABLES where TABLE_SCHEMA = "_DATABASE_NAME_" order by (DATA_LENGTH + INDEX_LENGTH) desc;

_Replacing _DATABASE_NAME_ with the actual database name._

Number of rows for each table:

select table_name, table_rows from information_schema.tables where table_schema='_DATABASE_NAME_';

Notable sizes...

table rows size
searches 6,193,567 2.08M
users 504,550 0.12M
versions 24,991 0.13M
elrayle commented 3 years ago

Searches table

How are they created?

See also: Issue #408 Searches for deleted exhibits are still in the searches table

ACTION

2021-02-02 - Manually deleted all searches older than 30 days using...

delete from searches where created_at < now() - interval 30 DAY;
BEFORE Delete AFTER Delete
2.08M 0.18M
elrayle commented 3 years ago

Users table

How are they created?

Count number of guest users:

select count(guest) from users where guest=1;

Alternate:

select count(id) from users where email like 'guest%';

NOTE: The count should be the same for both approaches.

ACTION

2021-02-02 - Manually deleted all guests using...

delete from users where guest=1;
BEFORE Delete AFTER Delete
0.12M 128kB
elrayle commented 3 years ago

Versions table

How are they created?

NEEDS FURTHER INVESTIGATION

elrayle commented 3 years ago

Need to set up nightly crons in production to:

@gdelisle Questions:

gdelisle commented 3 years ago

The AWS prescription for this involves setting up a Lambda job to do the task. Here is a sort of best practices guide that they have written on the subject: https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/schedule-jobs-for-amazon-rds-and-aurora-postgresql-using-lambda-and-secrets-manager.html

chrisrlc commented 1 week ago

Completed as part of https://culibrary.atlassian.net/browse/LP-222

gdelisle commented 1 week ago

Wait! Can we talk about this a little bit more before it's gone? There is a Lambda job currently running for each tier (three jobs in total) to clear out the database every Wednesday morning. I'll paste the code here, but I'd like to confirm that this new function has replaced the Lambda jobs, and if that's true I can delete the Lambda jobs.

import sys
import logging
import pymysql
import os
#rds settings
rds_host  = os.environ.get('db_host')
name = os.environ.get('db_username')
password = os.environ.get('db_password')
db_name = os.environ.get('db_name')

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    try:
        conn = pymysql.connect(host=rds_host, user=name, passwd=password, db=db_name, connect_timeout=5)
    except pymysql.MySQLError as e:
        logger.error("ERROR: Unexpected error: Could not connect to MySQL instance.")
        logger.error(e)
        sys.exit()
logger.info("SUCCESS: Connection to RDS MySQL instance succeeded")
"""
This function fetches content from MySQL RDS instance
"""

item_count = 0

with conn.cursor() as cur:
    cur.execute('delete from searches where created_at < now() - interval 30 DAY;')
    cur.execute('delete from users where guest=1 and created_at < now() - interval 2 DAY;')
    conn.commit()

return "Maintenance done"
logger.info("SUCCESS: Maintenance done")
chrisrlc commented 1 week ago

@gdelisle Oh! I didn't know this was already done as lambdas. That makes sense why I wasn't seeing any guest users in the db then. Yes, feel free to get rid of the lambdas. I'm running the cleanup as rake tasks provided by blacklight, which are running rails ActiveRecord deletes behind the scenes. The rake tasks are going to better be able to handle cleaning up any associations and etc, vs raw sql queries.

I didn't add a cron job for cleaning up the guest users because I saw that there weren't any and just assumed something had changed about how blacklight runs since this ticket was created. So I'll create a separate Jira ticket for handling just that (https://culibrary.atlassian.net/browse/LP-1087). Since we have another spotlight sprint in 2 weeks, it's still okay to delete the lambda job now and that piece of it will get replaced soon.

gdelisle commented 1 week ago

Honestly I almost forgot they were there, until I was reading the comments from 2021 about possibly making Lambdas, and then it jogged my memory because I was the one that wrote them so :B

On Sep 3, 2024, at 12:40 PM, Christina Cortland @.***> wrote:

@gdelislehttps://github.com/gdelisle Oh! I didn't know this was already done as lambdas. That makes sense why I wasn't seeing any guest users in the db then. Yes, feel free to get rid of the lambdas. I'm running the cleanup as rake tasks provided by blacklight, which are running rails ActiveRecord deletes behind the scenes. The rake tasks are going to better be able to handle cleaning up any associations and etc, vs raw sql queries.

I didn't add a cron job for cleaning up the guest users because I saw that there weren't any and just assumed something had changed about how blacklight runs since this ticket was created. So I'll create a separate Jira ticket for handling just that. Since we have another spotlight sprint in 2 weeks, it's still okay to delete the lambda job now and that piece of it will get replaced soon.

— Reply to this email directly, view it on GitHubhttps://github.com/cul-it/exhibits-library-cornell-edu/issues/409#issuecomment-2326978299, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIROZU5RD7NBWJL4LCH2TYDZUXRBBAVCNFSM6AAAAABNSD3KOWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRWHE3TQMRZHE. You are receiving this because you were mentioned.Message ID: @.***>

gdelisle commented 1 week ago

I have edited the Lambda jobs to "set concurrency to zero" which is supposedly the way to prevent them from executing. Since they're scheduled to execute on Wednesday we should see the results of what would happen when these Lambdas are no longer there.