PgHero cleanup - Githubissues

rmtolmach commented 5 months ago

Issue Description

PgHero (http://pghero-prod.vfs.va.gov/) is reporting some slow queries and, occasionally a long-running query.

Some of the slowest and/or most frequently called queries should be addressed. We will want to report significant findings to the teams that own the related code.

Tasks

[x] investigate the yellow alerts in PgHero
[ ] Fix issues or pass along findings to relevant teams
- [ ] Link conversations, tickets, or PRs as comments in this issue.

Success Metrics

The info in PgHero has been analyzed and acted upon.

Acceptance Criteria

[ ] Teams have been notified
[ ] links to slack threads or tickets created from this are linked in the comments of this issue
[ ] PG Lockout is created for extended run queries

Validation

Assignee to add steps to this section. List the actions that need to be taken to confirm this issue is complete. Include any necessary links or context. State the expected outcome.

RachalCassity commented 4 months ago

Screenshot 2024-05-02 at 10.57.46 AM.png

jennb33 commented 3 months ago

This might involve the VFS teams as well!

rmtolmach commented 3 months ago

Invalid Indexes

These indexes exist, but can’t be used. You should recreate them.

index_flipper_features_on_key added in https://github.com/department-of-veterans-affairs/vets-api/pull/15656

index_flipper_gates_on_feature_key_and_key_and_value also added in https://github.com/department-of-veterans-affairs/vets-api/pull/15656

index_ivc_champva_forms_on_form_uuid added in https://github.com/department-of-veterans-affairs/vets-api/pull/16721

[x] Drop and add in the same migration. In one PR. Something like this:
```
def change
remove index if exists
add index
end
```
1. https://github.com/department-of-veterans-affairs/vets-api/pull/17320 - 👎 ended up having to revert. Trying this one again in https://github.com/department-of-veterans-affairs/vets-api/pull/17331 and https://github.com/department-of-veterans-affairs/vets-api/pull/17337
2. https://github.com/department-of-veterans-affairs/vets-api/pull/17323 - 👎 ended up having to revert

rmtolmach commented 3 months ago

Duplicate Indexes

These indexes exist, but aren’t needed. Remove them for faster writes.

On accreditations
index_accreditations_on_accredited_individual_id (accredited_individual_id) is covered by index_accreditations_on_indi_and_org_ids (accredited_individual_id, accredited_organization_id)
On async_transactions
index_async_transactions_on_transaction_id (transaction_id) is covered by index_async_transactions_on_transaction_id_and_source (transaction_id, source)
On va_notify_in_progress_reminders_sent index_va_notify_in_progress_reminders_sent_on_user_account_id (user_account_id) is covered by index_in_progress_reminders_sent_user_account_form_id (user_account_id, form_id)

These are all composite indexes. Since the composite index can serve the same queries as the single-column index, the single-column index is redundant and can be dropped. First, we can check some usage stats to see how often the individual index is used. If it's low, or unused, we can drop it.

[ ] Create a migration to change the table to drop the index for the dupes.

rmtolmach commented 3 months ago

Suggested Indexes

Add indexes to speed up queries.

CREATE INDEX CONCURRENTLY ON in_progress_forms (form_id, created_at)
CREATE INDEX CONCURRENTLY ON saved_claims (type, id)

We could generate a new migration for these two (or have the owning VFS team do it). Based on the stats in PgHero, we wouldn't save that much time, but maybe it's still worth it?

edit: I refreshed the pghero page and this warning was gone. There were no suggested indexes.

rmtolmach commented 3 months ago

Slow Queries

Slow queries take 20 ms or more on average and have been called at least 100 times.

There are 8 of these currently. The biggest offender is a call to the vba_documents_upload_submissions table which takes an average of 3 seconds! This call takes place in the vba_documents module in upload_remover.rb.

[x] Create a composite index (multi-column index) with the four indexes used. Here's the schema:

create_table "vba_documents_upload_submissions", id: :serial, force: :cascade do |t|
t.uuid "guid", null: false
t.string "status", default: "pending", null: false
t.string "code"
t.string "detail"
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
t.boolean "s3_deleted"
t.string "consumer_name"
t.uuid "consumer_id"
t.json "uploaded_pdf"
t.boolean "use_active_storage", default: false
t.jsonb "metadata", default: {}
t.index ["created_at"], name: "index_vba_documents_upload_submissions_on_created_at"
t.index ["guid"], name: "index_vba_documents_upload_submissions_on_guid"
t.index ["s3_deleted"], name: "index_vba_documents_upload_submissions_on_s3_deleted"
t.index ["status"], name: "index_vba_documents_upload_submissions_on_status"
end

[ ] Create a composite index tag, title, and form_name (based on the code in form.rb)

create_table "va_forms_forms", force: :cascade do |t|
t.string "form_name"
t.string "url"
t.string "title"
t.date "first_issued_on"
t.date "last_revision_on"
t.integer "pages"
t.string "sha256"
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
t.boolean "valid_pdf", default: false
t.text "form_usage"
t.text "form_tool_intro"
t.string "form_tool_url"
t.string "form_type"
t.string "language"
t.datetime "deleted_at"
t.string "related_forms", array: true
t.jsonb "benefit_categories"
t.string "form_details_url"
t.jsonb "va_form_administration"
t.integer "row_id"
t.float "ranking"
t.string "tags"
t.date "last_sha256_change"
t.jsonb "change_history"
t.index ["valid_pdf"], name: "index_va_forms_forms_on_valid_pdf"
end

Not sure how to test it to make sure it's working. We could test locally. Pair on this task. Before deploy, clear out any long-running queries.

rmtolmach commented 2 months ago

High Number of Connections

1124 connections Use connection pooling for better performance. PgBouncer is a solid option.

Recently, we went from 6 workers and 3 threads per worker to 6 workers and 8 threads per worker (so 48 available per pod). This could have been a cause for the high number of connections, but based on the fact that this number is similar to the number pasted in the screenshot of the description, I'm guessing the change had no impact and this is just an issue we've had for a long time.

❓ How many free database connections do we have? The number of connections depends on the RDS size. Currently, our RDS can handle 2000, so it's fine.

[x] Fix this monitor! https://vagov.ddog-gov.com/monitors/100559 it should be 2000, not 2500. Based on these AWS docs, it says db.r5.xlarge is 2000 max_connections default value.

rmtolmach commented 2 months ago

Long Running Queries

it should be 5 minutes, but it's 21 days. - Update the statement_timeout in terraform. Reach out to Kshitiz and Chris to make sure they haven't done this already.
Some teams run queries in EC2 instance and they stay open. Specifically, forms teams.

rmtolmach commented 2 months ago

End of week update:

While working on the Invalid Indexes, there were some duplicate key errors during the migration. Slack thread. This is still in-flight.
See the comments above for the items left to do. I can do them when I'm back if they're not completed this sprint.

jennb33 commented 2 months ago

This ticket has been duped for Sprint 5 as 87373, at 5 story points, in case the work is not completed in Sprint 4.

rjohnson2011 commented 2 months ago

Merged this PR which resolves high number of connections error threshold issue Rebecca noted above -> https://github.com/department-of-veterans-affairs/devops/pull/14501

jennb33 commented 2 months ago

Closing this ticket for Sprint 4, any additional work can be completed in Sprint 5, when @rmtolmach is back in office. TY @rjohnson2011 !

rjohnson2011 commented 2 months ago

Merged and closed this PR which is bringing down the slow query by .8ms ->https://github.com/department-of-veterans-affairs/vets-api/pull/17377

Logs/Details in the PR

department-of-veterans-affairs / va.gov-team

PgHero cleanup #80648

Issue Description

Tasks

Success Metrics

Acceptance Criteria

[ ] PG Lockout is created for extended run queries

Validation

Invalid Indexes

Duplicate Indexes

Suggested Indexes

Slow Queries

High Number of Connections

Long Running Queries