Consider garbage collection for production database

elrayle commented 3 years ago

Description

Production database is > 2.5GB. Most of that is in the searches table which is > 2GB.

To get the size of the the entire database:

dumped to file system
look at size of file

See the size of each table:

select TABLE_NAME AS `Table`, ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024) AS `Size (MB)` from information_schema.TABLES where TABLE_SCHEMA = "_DATABASE_NAME_" order by (DATA_LENGTH + INDEX_LENGTH) desc;

_Replacing _DATABASE_NAME_ with the actual database name._

Number of rows for each table:

select table_name, table_rows from information_schema.tables where table_schema='_DATABASE_NAME_';

Notable sizes...

table	rows	size
searches	6,193,567	2.08M
users	504,550	0.12M
versions	24,991	0.13M

elrayle commented 3 years ago

Searches table

How are they created?

blacklight adds a row in the searches table every time a search is performed
it is safe to delete these and typically are deleted when they are older than 7 days
blacklight has a rake task to do this... delete_old_searches

See also: Issue #408 Searches for deleted exhibits are still in the searches table

ACTION

2021-02-02 - Manually deleted all searches older than 30 days using...

delete from searches where created_at < now() - interval 30 DAY;

BEFORE Delete	AFTER Delete
2.08M	0.18M

elrayle commented 3 years ago

Users table

How are they created?

by signing up - would require manual exploration to determine if it is safe to delete
guest users - can be deleted

Count number of guest users:

select count(guest) from users where guest=1;

Alternate:

select count(id) from users where email like 'guest%';

NOTE: The count should be the same for both approaches.

ACTION

2021-02-02 - Manually deleted all guests using...

delete from users where guest=1;

BEFORE Delete	AFTER Delete
0.12M	128kB

elrayle commented 3 years ago

Versions table

How are they created?

Paper-Trail adds rows to the Versions table. Not sure what triggers a new version.

NEEDS FURTHER INVESTIGATION

elrayle commented 3 years ago

Need to set up nightly crons in production to:

delete all guests older than 2 days
delete all searches older than 7 days

@gdelisle Questions:

How to do this in AWS?
Is there a way to execute the mysql queries or do we need ruby code to execute?

gdelisle commented 3 years ago

The AWS prescription for this involves setting up a Lambda job to do the task. Here is a sort of best practices guide that they have written on the subject: https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/schedule-jobs-for-amazon-rds-and-aurora-postgresql-using-lambda-and-secrets-manager.html

chrisrlc commented 1 week ago

Completed as part of https://culibrary.atlassian.net/browse/LP-222

gdelisle commented 1 week ago

Wait! Can we talk about this a little bit more before it's gone? There is a Lambda job currently running for each tier (three jobs in total) to clear out the database every Wednesday morning. I'll paste the code here, but I'd like to confirm that this new function has replaced the Lambda jobs, and if that's true I can delete the Lambda jobs.

import sys
import logging
import pymysql
import os
#rds settings
rds_host  = os.environ.get('db_host')
name = os.environ.get('db_username')
password = os.environ.get('db_password')
db_name = os.environ.get('db_name')

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    try:
        conn = pymysql.connect(host=rds_host, user=name, passwd=password, db=db_name, connect_timeout=5)
    except pymysql.MySQLError as e:
        logger.error("ERROR: Unexpected error: Could not connect to MySQL instance.")
        logger.error(e)
        sys.exit()

logger.info("SUCCESS: Connection to RDS MySQL instance succeeded")
"""
This function fetches content from MySQL RDS instance
"""

item_count = 0

with conn.cursor() as cur:
    cur.execute('delete from searches where created_at < now() - interval 30 DAY;')
    cur.execute('delete from users where guest=1 and created_at < now() - interval 2 DAY;')
    conn.commit()

return "Maintenance done"
logger.info("SUCCESS: Maintenance done")

chrisrlc commented 1 week ago

@gdelisle Oh! I didn't know this was already done as lambdas. That makes sense why I wasn't seeing any guest users in the db then. Yes, feel free to get rid of the lambdas. I'm running the cleanup as rake tasks provided by blacklight, which are running rails ActiveRecord deletes behind the scenes. The rake tasks are going to better be able to handle cleaning up any associations and etc, vs raw sql queries.

I didn't add a cron job for cleaning up the guest users because I saw that there weren't any and just assumed something had changed about how blacklight runs since this ticket was created. So I'll create a separate Jira ticket for handling just that (https://culibrary.atlassian.net/browse/LP-1087). Since we have another spotlight sprint in 2 weeks, it's still okay to delete the lambda job now and that piece of it will get replaced soon.

gdelisle commented 1 week ago

Honestly I almost forgot they were there, until I was reading the comments from 2021 about possibly making Lambdas, and then it jogged my memory because I was the one that wrote them so :B

On Sep 3, 2024, at 12:40 PM, Christina Cortland @.***> wrote:

@gdelislehttps://github.com/gdelisle Oh! I didn't know this was already done as lambdas. That makes sense why I wasn't seeing any guest users in the db then. Yes, feel free to get rid of the lambdas. I'm running the cleanup as rake tasks provided by blacklight, which are running rails ActiveRecord deletes behind the scenes. The rake tasks are going to better be able to handle cleaning up any associations and etc, vs raw sql queries.

I didn't add a cron job for cleaning up the guest users because I saw that there weren't any and just assumed something had changed about how blacklight runs since this ticket was created. So I'll create a separate Jira ticket for handling just that. Since we have another spotlight sprint in 2 weeks, it's still okay to delete the lambda job now and that piece of it will get replaced soon.

— Reply to this email directly, view it on GitHubhttps://github.com/cul-it/exhibits-library-cornell-edu/issues/409#issuecomment-2326978299, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIROZU5RD7NBWJL4LCH2TYDZUXRBBAVCNFSM6AAAAABNSD3KOWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRWHE3TQMRZHE. You are receiving this because you were mentioned.Message ID: @.***>

gdelisle commented 1 week ago

I have edited the Lambda jobs to "set concurrency to zero" which is supposedly the way to prevent them from executing. Since they're scheduled to execute on Wednesday we should see the results of what would happen when these Lambdas are no longer there.

cul-it / exhibits-library-cornell-edu