AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
129 stars 19 forks source link

Look into autovacuum's behavior for downloader_jobs and original_files tables #2874

Open kurtwheeler opened 3 years ago

kurtwheeler commented 3 years ago

Context

In https://github.com/AlexsLemonade/refinebio/pull/2867 @wvauclain wrote:

A last thing that we could try is to manually vacuum some of the worst tables, like the processor jobs, downloader jobs, and samples. I have a feeling that these tables are pretty fragmented because we delete things from it so often when unsurveying experiments. A full vacuum requires writing out a new database file and copying everything over while the database is fully locked, though, so I am not sure about the RDS costs associated with that or what the downtime would be.

Problem or idea

We have autovacuum on by defaulting and it is running for some tables. It doesn't seem to have triggered for the original_files nor downloader_jobs tables. I'm not sure if this is expected or not. I started looking into it using this query:

SELECT * FROM pg_stat_all_tables where relname not like 'pg_%' order by last_autovacuum desc;

but I'm not sure what most of the columns returned mean, only that last_autovacuum being null must mean that a table hasn't ever been autovacuumed.

Solution or next step

Look into https://aws.amazon.com/blogs/database/understanding-autovacuum-in-amazon-rds-for-postgresql-environments/, https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.Autovacuum.html#Appendix.PostgreSQL.CommonDBATasks.Autovacuum.NeedVacuuming, and the results of that query to see if there's something we can do to improve the query performance of those tables, perhaps by getting autovacuum to trigger for them.

wvauclain commented 3 years ago

I just manually vacuumed the processor job and downloader job tables using VACUUM ANALYZE [table], and anecdotally the queries do seem to be firing somewhat faster. It looks like this is definitely something we need to look into.

davidsmejia commented 3 years ago

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.Autovacuum.html#Appendix.PostgreSQL.CommonDBATasks.Autovacuum.WorkMemory

I think we need to add:

# Temporary. This will go away once we can delete things.
resource "aws_db_parameter_group" "postgres_parameter_group" {
  name = "postgres-parameter-group-${var.user}-${var.stage}"
  description = "Postgres Parameters ${var.user} ${var.stage}"
  family = "postgres11"

  parameter {
    name = "deadlock_timeout"
    value = "60000" # 60000ms = 60s
  }

  parameter {
    name = "statement_timeout"
    value = "60000" # 60000ms = 60s
  }

  # autovacuum will not work by default
  parameters {
    name  = "autovacuum"
    value = 1
  }

  tags = var.default_tags
}

But I still need to find out how to enable track_counts

As per the docs

autovacuum (boolean) Controls whether the server should run the autovacuum launcher daemon. This is on by default; however, track_counts must also be enabled for autovacuum to work. This parameter can only be set in the postgresql.conf file or on the server command line; however, autovacuuming can be disabled for individual tables by changing table storage parameters.

kurtwheeler commented 3 years ago

I don't think we need to add that. As I mentioned before, autovacuum is indeed running, it's just not running for all tables. We probably need to tweak thresholds. I think https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.Autovacuum.html#Appendix.PostgreSQL.CommonDBATasks.Autovacuum.NeedVacuuming covers how those work

data_refinery=> select name, setting from pg_settings where name in ('autovacuum', 'track_counts');
     name     | setting 
--------------+---------
 autovacuum   | on
 track_counts | on