galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.37k stars 992 forks source link

Hard removal of items in the database (scalability discussion) #13992

Open jdavcs opened 2 years ago

jdavcs commented 2 years ago

The following is a subset of items identified during the scalability discussion at the team meeting in Montpellier (May 16-20, 2022):

hexylena commented 2 years ago

database objects (tables, columns) that consume the most resources

I made this graph a long time ago to answer the same question, hope it helps! https://stats.galaxyproject.eu/d/000000019/galaxy-database?orgId=1&refresh=5m&viewPanel=40

jdavcs commented 2 years ago

I made this graph a long time ago to answer the same question, hope it helps!

Oh yes, absolutely, thanks! I have a spreadsheet with reasonably precise estimates of number of rows per table on main, but that one is from 2020, so I'll update it and then post it here. (while size per table is a more useful metric, a curious detail, at least from 2 years ago, was that 42 tables had zero rows).

hexylena commented 2 years ago

Really easy to get from the following query if you just need fast estimates:

SELECT reltuples, relname FROM pg_class where relname not like 'pg_%' and relname not like 'ix_%' and relname not like '%_pkey' order by reltuples desc;
   reltuples   |                             relname
---------------+-----------------------------------------------------------------
  5.796195e+08 | job_metric_numeric
 4.5395987e+08 | job_parameter
 3.0803376e+08 | job_state_history
 1.6113411e+08 | history_dataset_association_history
  1.511119e+08 | dataset_permissions
 1.0779887e+08 | history_dataset_association
  9.121957e+07 | job_to_input_dataset
  8.872237e+07 | dataset
  8.311228e+07 | dataset_collection_element
 7.9665984e+07 | job_to_output_dataset
 4.7278964e+07 | job
 4.2060216e+07 | job_to_output_dataset_collection
 4.1388948e+07 | job_metric_text
 3.2805136e+07 | cleanup_event_dataset_association
 3.2322692e+07 | implicit_collection_jobs_job_association
 2.5986542e+07 | galaxy_session
 2.0864564e+07 | cleanup_event_hda_association

~I'll add it to gxadmin since that's a useful query to have.~ Edit it's there. gxadmin query pg-rows-per-table

$ gxadmin query pg-rows-per-table | head
 table_schema |                           table_name                           |     rows
--------------+----------------------------------------------------------------+---------------
 galaxy       | job_metric_numeric                                             |  5.796195e+08
 galaxy       | job_parameter                                                  | 4.5395987e+08
 galaxy       | job_state_history                                              | 3.0803376e+08
 galaxy       | history_dataset_association_history                            | 1.6113411e+08
 galaxy       | dataset_permissions                                            |  1.511119e+08
 galaxy       | history_dataset_association                                    | 1.0779887e+08
 galaxy       | job_to_input_dataset                                           |  9.121957e+07
 galaxy       | dataset                                                        |  8.872237e+07

Edit 2: 50 empty tables

$ gxadmin query pg-rows-per-table | grep ' 0' | wc -l
50
jdavcs commented 2 years ago

Really easy to get from the following query if you just need fast estimates:

Thank you! Very easy indeed. So we have 110(!) empty tables (which includes a few that are not in the model, but still..)

natefoo commented 2 years ago

And for the record, here's the query I was using in Montpellier to get some of the space usage numbers:

galaxy_main=> SELECT *, pg_size_pretty(total_bytes) AS total
    , pg_size_pretty(index_bytes) AS index
    , pg_size_pretty(toast_bytes) AS toast
    , pg_size_pretty(table_bytes) AS table
  FROM (
  SELECT *, total_bytes-index_bytes-coalesce(toast_bytes,0) AS table_bytes FROM (
      SELECT c.oid,nspname AS table_schema, relname AS table_name
              , c.reltuples AS row_estimate
              , pg_total_relation_size(c.oid) AS total_bytes
              , pg_indexes_size(c.oid) AS index_bytes
              , pg_total_relation_size(reltoastrelid) AS toast_bytes
          FROM pg_class c
          LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
          WHERE relkind = 'r'
  ) a
) a order by a.total_bytes desc;

Just total bytes alone would be useful as a gxadmin query to use as a Grafana data source.