Fetch data from the Read RDS instance for auth

nikolai-efimov commented 1 year ago

User Story - Business Need

The database receives a lot of requests, so we need to configure it in a way that reduces the load on write instance(s).

[x] Ticket is understood, and QA has been contacted (if the ticket has a QA label).

User Story(ies)

As a system I want my read requests hit the read db instance So that it reduces the load on the write instance

Additional Info and Resources

Multiple Databases with Binds Configure Flask-SQLAlchemy to use multiple databases

In our case we don't want to bind models (as in linked examples), we just want read from a different engine to avoid having to change a lot of code.
Basically we want to do something like this:

app = Flask(__name__)
app.config['SQLALCHEMY_DATABASE_URI'] = <write-db connection string>
app.config['SQLALCHEMY_BINDS'] = {
    'read-db': <read-db connection string>
}
db = SQLAlchemy(app)

class Test(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    desc = db.Column(db.String(80), unique=True, nullable=False)

    def __repr__(self):
        return f"Test #{self.id} '{self.desc}'"

def list_tests_using_read_db():
    # this is what we want to be doing
    reader = db.engines['read-db']
    session = scoped_session(sessionmaker(bind=reader))
    for record in session.query(Test).all():
        print(record)

def list_tests_using_write_db():
    # this is what we do now
    for record in Test.query.all():
        print(record)

Files will need to be changed. There might be some other places that will need to be updated.

app/config.py
app/dao/services_dao.py
cd/application-deployment/dev/vaec-api-task-definition.json
cd/application-deployment/dev/vaec-celery-beat-task-definition.json

Engineering Checklist

[x] SQLALCHEMY_DATABASE_URI should still point to where it's pointing now
[x] add SQLALCHEMY_DATABASE_URI_READ. Coordinate with Infrastructure team. It should have a record for read-db pointing to a read instance. See "Additional Info and Resources"
[x] update app/config.py (add SQLALCHEMY_BINDS)
[x] update app/dao/services_dao.py (dao_fetch_service_by_id_with_api_keys)
[x] Leave all write operations (insert, update, delete) to use the original SQLAlchemy instance
[x] update dao_fetch_service_by_id_with_api_keys to use read-db
[x] check behavior around \@transactional and debug ("None is already present in this session")

Acceptance Criteria

[x] Connections and requests visible in AWS console (such as number of connections and read iops)
[x] Only the dao_fetch_service_by_id_with_api_keys query is changed to utilize the read instance
[x] Regression in Dev throws no errors/failures
[x] Verify that there are no errors like None is already present in this session

QA Considerations

Test to ensure all read and write operations are working as expected

Potential Dependencies

SQLALCHEMY_DATABASE_URI_READ needs to be create in infrastructure

Out of Scope

Identify and modify all places in the code where fetching can be done from a read db instance

mjones-oddball commented 1 year ago

@nikolai-efimov I notice you have this listed as draft. Are you looking to add/adjust the content before we review it? Or is it ready to review? cc @k-macmillan

nikolai-efimov commented 1 year ago

@mjones-oddball , it's ready for a review. Thank you!

mjones-oddball commented 1 year ago

Hey team! Please add your planning poker estimate with Zenhub @cris-oddball @justaskdavidb2 @k-macmillan @kalbfled @ldraney @nikolai-efimov

tabinda-syed commented 1 year ago

Nikolai's recommendation is that we do ticket 1345 before this one.

nikolai-efimov commented 1 year ago

Because we're returning a database model return query.one() instead of actual data in dao_fetch_service_by_id_with_api_keys , we should realize that now that object is going to be bound to READ database instance, which means that we MUST make sure that NO db updates happen using authenticated_service

cris-oddball commented 1 year ago

@nikolai-efimov to duplicate the closed caching ticket, 1345 and clean it up if needed and proceed with opening a PR and testing this ticket.

nikolai-efimov commented 1 year ago

Double call to auth seems to be failing unit test because one of them fails.

This is trace from normal app execution showing two calls to auth per single http request:

INFO 2023-08-01 22:02:24,396 /app/app/authentication/auth.py:182: API authorised for service d6aa2c68-a2d9-4437-ab19-3ae8eb202553 with api key ef12892a-1a02-425d-aa48-05fc345efaf2, using client PostmanRuntime/7.32.3
2023-08-01T22:02:24 app app INFO None "API authorised for service d6aa2c68-a2d9-4437-ab19-3ae8eb202553 with api key ef12892a-1a02-425d-aa48-05fc345efaf2, using client PostmanRuntime/7.32.3" [in /app/app/authentication/auth.py:182]

INFO 2023-08-01 22:02:24,401 /app/app/authentication/auth.py:182: API authorised for service d6aa2c68-a2d9-4437-ab19-3ae8eb202553 with api key ef12892a-1a02-425d-aa48-05fc345efaf2, using client PostmanRuntime/7.32.3
2023-08-01T22:02:24 app app INFO None "API authorised for service d6aa2c68-a2d9-4437-ab19-3ae8eb202553 with api key ef12892a-1a02-425d-aa48-05fc345efaf2, using client PostmanRuntime/7.32.3" [in /app/app/authentication/auth.py:182]

this seems to be the in app/init.py place where auth calls are being bound to requests

v2_notifications.before_request(validate_service_api_key_auth)
    get_notifications.before_request(validate_service_api_key_auth)
    application.register_blueprint(v2_notifications)

before_request registers a function to be run before each request within a blueprint regardless of the http method. Since both cover /v2/notifications , we could probably remove one.

commit that introduced 2nd binding is https://github.com/department-of-veterans-affairs/notification-api/commit/8050013a061d228bf1890085de6eaf0637e72799

nikolai-efimov commented 1 year ago

ci-test-1  | statement = 'SELECT services.id AS services_id, services.name AS services_name, services.created_at AS services_created_at, servic... services LEFT OUTER JOIN api_keys AS api_keys_1 ON services.id = api_keys_1.service_id \nWHERE services.id = %(id_1)s'
ci-test-1  | parameters = {'id_1': '9adb8533-da7e-4027-9095-62fd1660e838'}
ci-test-1  | context = <sqlalchemy.dialects.postgresql.psycopg2.PGExecutionContext_psycopg2 object at 0xffff89b77130>
ci-test-1  |
ci-test-1  |     def do_execute(self, cursor, statement, parameters, context=None):
ci-test-1  | >       cursor.execute(statement, parameters)
ci-test-1  | E       sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) relation "services" does not exist
ci-test-1  | E       LINE 2: FROM services LEFT OUTER JOIN api_keys AS api_keys_1 ON serv...

data is missing in PostgreSQL - DEV WRITE/Databases/test_notification_api_gw0/Schemas/public/Tables/services table

how are these databases populated?

cris-oddball commented 1 year ago

Branch deployed to Dev. User Flows and the Regression both failed on a status being stuck in created (this is because the Dev queue is overloaded from the testing 3 days ago).

The DB monitors in RDS for the Reader instance are not granular enough to see if the Reader instance is being used. The spike seen in the graphs below were observed to occur right at the end of the deploy. Unclear if User Flows was running. No corresponding spike was seen when the Regression ran.

Screen Shot 2023-08-07 at 10.40.04 AM.jpg

There are no Reader logs for the last 30 minutes since the deploy (during which time the regression ran), so it's hard to tell if the Reader instance is being utilized.

Unclear where to effectively look in Datadog for any of this information.

Nikolai to ask Kyle or Dave to jump onto the Dev DB via the jump host.

nikolai-efimov commented 1 year ago

Screen Shot 2023-08-09 at 3.22.03 PM.png

403 test case is fixed.

re-requesting PR reviews from @k-macmillan and @kalbfled

nikolai-efimov commented 1 year ago

@Cris Stoddard, deployment went through successfully , including user flows. We can start on QA testing whenever you have time for it. Thank you. https://github.com/department-of-veterans-affairs/notification-api/actions/runs/5814934646

cris-oddball commented 1 year ago

Code is merged and deploying to Perf for testing.

cris-oddball commented 1 year ago

After deploying master to Perf, the first regression ran in the cloud had a number of failures:

FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Basic Formatting]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Personalization]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Custom Reply-To on Template]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Basic Formatting and Optional Properties]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Basic Formatting w/ VAPROFILEID and Optional Properties]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Basic Formatting w/ ICN and Optional Properties]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Basic Formatting w/ BIRLSID and Optional Properties]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Basic Formatting w/ EDIPI and Optional Properties]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Default Send True w/ VAPROFILEID w/o Comm Items]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Default Send False w/ VAPROFILEID w/o Comm Items]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Default Send True w/ VAPROFILEID w/ Allowed Comm Items]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Default Send False w/ VAPROFILEID w/ Allowed Comm Items]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Default Send True w/ VAPROFILEID w/ Disallowed Comm Items]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Default Send False w/ VAPROFILEID w/ Disallowed Comm Items]
FAILED tests/test_v2_notifications_email.py::test_send_email_missing_required_property_template_id
FAILED tests/test_v2_notifications_email.py::test_send_email_missing_required_property_contact_info
FAILED tests/test_v2_notifications_email.py::test_send_email_incorrect_type_template_id_400
FAILED tests/test_v2_notifications_email.py::test_send_email_incorrect_format_email_400
FAILED tests/test_v2_notifications_email.py::test_send_email_invalid_type_email_400
FAILED tests/test_v2_notifications_email.py::test_send_email_incorrect_type_recipient_identifier_400
FAILED tests/test_v2_notifications_email.py::test_send_email_incorrect_type_recipient_identifier_properties_400
FAILED tests/test_v2_notifications_email.py::test_send_email_incorrect_type_reference_400
FAILED tests/test_v2_notifications_email.py::test_send_email_incorrect_type_email_reply_to_id_400
FAILED tests/test_v2_notifications_email.py::test_send_email_incorrect_type_billing_code_400
FAILED tests/test_v2_notifications_email.py::test_send_email_incorrect_type_personalisation_400
FAILED tests/test_v2_notifications_email.py::test_send_email_invalid_template_id_400
FAILED tests/test_v2_notifications_email.py::test_send_email_invalid_email_reply_to_id_400
FAILED tests/test_v2_notifications_email.py::test_send_email_unexpected_payload_properties_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Short] - Assert...
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Short Personalized]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Short w/ URL]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Short w/ optional properties]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Long 160+] - As...
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Short w/ alternate Pinpoint SMS Sender ID]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Short w/ BIRLSID recipient ID and Optional Properties]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Short w/ ICN recipient ID and Optional Properties]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Short w/ EDIPI recipient ID and Optional Properties]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Short w/ VAPROFILEID and Optional Properties]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Default Send True w/ VAPROFILEID w/o Comm Items]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Default Send False w/ VAPROFILEID w/o Comm Items]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Default Send True w/ VAPROFILEID w/ Allowed Comm Items]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Default Send False w/ VAPROFILEID w/ Allowed Comm Items]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Default Send True w/ VAPROFILEID w/ Disallowed Comm Items]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Default Send False w/ VAPROFILEID w/ Disallowed Comm Items]
FAILED tests/test_v2_notifications_sms.py::test_send_sms_missing_required_property_template_id_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_missing_required_property_contact_info_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_incorrect_type_template_id_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_incorrect_format_phone_number_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_incorrect_type_phone_number_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_incorrect_type_recipient_identifier_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_incorrect_type_recipient_identifier_properties_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_incorrect_type_sms_sender_id_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_incorrect_type_reference_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_incorrect_type_billing_code_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_incorrect_type_personalisation_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_invalid_template_id_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_invalid_sms_sender_id_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_unexpected_payload_properties_400
============= 58 failed, 72 passed, 4 skipped, 3 xfailed in 38.07s =============
Error: Process completed with exit code 1.

Basically all of the email and sms routes (the GET v2/notifications test did not run for either since no emails or sms were created. They all appeared to fail with 500 Internal Server Error responses.

Testing from local against the public endpoints next.

cris-oddball commented 1 year ago

Exactly the same results when running the regression locally. 500 Internal Server Error.

Sent myself an email via Postman, same result.

{
    "message": "Internal server error",
    "result": "error"
}

Looked briefly in the logs and found a very long stack trace ending in KeyError: 'read-db' Screen Shot 2023-08-10 at 8.02.17 PM.jpg

Datadog shows no hits to the read only instance using the query during the time the regression ran:

env:perf service:postgres operation_name:postgres.query @network.destination.ip:*-ro-*

Screen Shot 2023-08-10 at 8.07.07 PM.jpg

AWS RDS Perf cluster shows no apparent use of the reader during the time frame Screen Shot 2023-08-10 at 8.09.36 PM.jpg

nikolai-efimov commented 1 year ago

Was Perf configured as Dev? Does it have parameter for SQLALCHEMY_DATABASE_URI_READ ?

resource "aws_ssm_parameter" "database_uri_read" {
  name        = "/${var.environment}/notification-api/database/uri_read"
  description = "The READ database URI for ${var.environment}"
  type        = "SecureString"
  value = format(
    "postgresql://%s:%s@%s:%s/%s",
    module.db.this_rds_cluster_master_username,
    module.db.this_rds_cluster_master_password,
    module.db.this_rds_cluster_reader_endpoint,
    module.db.this_rds_cluster_port,
  module.db.this_rds_cluster_database_name)
  key_id = local.notification_kms_key_id
  tags   = local.default_tags
}

cris-oddball commented 1 year ago

@nikolai-efimov Perf does have the following:

All running tasks for the perf-notification-api-service do have the env var SQLALCHEMY_DATABASE_URI_READ
Perf does have a property uri_read in the parameter store

What else needs to be checked here to make sure the code deployed matches the actual infrastructure?

kalbfled commented 1 year ago

This is the traceback:

File \"/usr/local/lib/python3.8/site-packages/flask/app.py\", line 1482, in full_dispatch_request
  rv = self.preprocess_request()
File \"/usr/local/lib/python3.8/site-packages/ddtrace/contrib/flask/helpers.py\", line 28, in wrapper
  return func(pin, wrapped, instance, args, kwargs)
File \"/usr/local/lib/python3.8/site-packages/ddtrace/contrib/flask/patch.py\", line 702, in _traced_request
  return wrapped(*args, **kwargs)
File \"/usr/local/lib/python3.8/site-packages/flask/app.py\", line 1974, in preprocess_request
  rv = self.ensure_sync(before_func)()
File \"/usr/local/lib/python3.8/site-packages/ddtrace/contrib/flask/wrappers.py\", line 76, in trace_func
  return wrapped(*args, **kwargs)
File \"/app/app/authentication/auth.py\", line 153, in validate_service_api_key_auth
  service = dao_fetch_service_by_id_with_api_keys(client)
File \"/app/app/dao/services_dao.py\", line 200, in dao_fetch_service_by_id_with_api_keys
  with get_reader_session() as session:
File \"/usr/local/lib/python3.8/contextlib.py\", line 113, in __enter__
  return next(self.gen)
File \"/app/app/dao/dao_utils.py\", line 101, in get_reader_session
  engine = db.engines['read-db']\nKeyError: 'read-db'

cris-oddball commented 1 year ago

Fix to bind to SQLAlchemy being queued up, will deploy this branch directly to Perf for testing before merging to master.

cris-oddball commented 1 year ago

Deploying Nikolai's new branch to Perf for validation.

cris-oddball commented 1 year ago

Failing on migrations.

re-deployed the commit previous to the 1344 commit to master, migrations succeeded, regression passed
re-deployed the new 1344 commit with the fix for sending messages and migrations failed
re-deployed master (with the 1344 commit that is broken, migrations succeeded (regression will fail).

Issue is related to multiple engines, when tries to talk to RO is failing (?). Which instance is it trying to do migrations on?

cris-oddball commented 1 year ago

The next steps:

code change that should fix the test failures but prevents the migration from running is in app/config.py lines 109-112

# DB conection string
SQLALCHEMY_DATABASE_URI = os.getenv('SQLALCHEMY_DATABASE_URI')
SQLALCHEMY_DATABASE_URI_READ = os.getenv('SQLALCHEMY_DATABASE_URI_READ')
SQLALCHEMY_BINDS = {"read-db": os.getenv("SQLALCHEMY_DATABASE_URI_READ")} # newline

revert master - Dave to do that
assumption: migrate once because master/replica - it's trying to migrate both
confirm which instance the migrations fail on, then we can confirm solutions. (unknown how to confirm) - hack migration and have it echo to stdout what it is doing
nice to have: checkbox for migrations during deployment

cris-oddball commented 1 year ago

ON DEV, the deploy with the fix on 1344v3 succeeds (migration is run). re-deploying the failed 1344v3 to perf with debug logging on.

cris-oddball commented 1 year ago

Reverted the 1344 commit and merged to master.

kalbfled commented 1 year ago

@nikolai-efimov Maybe this is useful: https://stackoverflow.com/questions/63606701/flask-migrate-for-multiple-database

cris-oddball commented 1 year ago

Making a note here about 1344 so we’re on the same page. 1344v2 deploys fine to Dev and Perf, but caused an error in Perf because read-db engine didn’t get attached. 1344v3 (with 1-line fix) deploys fine in Dev, but fails on migrations in Perf. Working theory right now is that v3 might be triggering migrations on read-only instance, which it should not be doing / or something around multiple engines + migrations. The question is what’s different between Dev and Perf. I’m looking into it and will report back with v4 to try

cris-oddball commented 1 year ago

Nikolai pushed an update to the 1344v3 branch to handle perf, staging and prod the same way that dev is handled. This branch also failed to run the migrations.

nikolai-efimov commented 1 year ago

I'll copy things I'm posting in eng chat here so @tabinda-syed and @mjones-oddball are in the loop.

1344v2 deploys fine to Dev and Perf, but caused an error in Perf because read-db engine didn’t get attached. 1344v3 (with 1-line fix) deploys fine in Dev, but fails on migrations in Perf. Working theory right now is that v3 might be triggering migrations on read-only instance, which it should not be doing / or something around multiple engines + migrations. The question is what’s different between Dev and Perf. I’m looking into it and will report back with v4 to try
Most evidence points to multiple engines not being an issue unless we specifically ask to run migrations on all additionally bound databases. In our files we just run CMD ["sh", "-c", "flask db upgrade"] which does not specify any additional dbs to be migrated. I also don't see anything wrong with migrations/env.py
Given that the updated v3 failed on migrations again on Perf and it was ok on Dev, and all configurations seem to be as similar as they can be. Next two suspects are ENV and DATA. Data being more of a suspect because there might be some data that is preventing migrations from happening on Perf. Existing data in the database might affect the migration if there are unique constraints that would be violated by migrations.

Next Action Items

we need to make sure that Perf is also Postgres 15 ... CC @kalbfled , @ldraney
we need to make sure that there is no data in Perf that might be preventing migration (check constraints or remove data) ... CC @cris-oddball

nikolai-efimov commented 1 year ago

Seems like the problem was undefined env variable SQLALCHEMY_DATABASE_URI_READ, which was missing from some of the task definition files. It was working okay in Dev because it was using a localhost default, which is currently unused in celery and would not cause any problems, but which would fail migrations if not specified at all.

cris-oddball commented 1 year ago

Latest branch opened for PR successfully ran the migration when deployed to Perf. Cloud regression for this branch on Perf: 132 passed, 2 skipped, 3 xfailed in 153.51s (0:02:33) Awaiting approvals.

cris-oddball commented 1 year ago

PR approved and merged, sending to Perf to complete testing by checking that the reader instance is actually used as described.

cris-oddball commented 1 year ago

Master successfully deployed to Perf (including migrations). Testing in progress.

cris-oddball commented 1 year ago

Speedy response (typically the first run after a deploy can take longer than usual) 132 passed, 2 skipped, 3 xfailed in 133.66s (0:02:13)

cris-oddball commented 1 year ago

Datadog APM Traces with the following query to isolate the reader instance

env:perf service:postgres operation_name:postgres.query @network.destination.ip:*-ro-*

shows 44 reads in the past 15 minutes, while the regression ran: Screen Shot 2023-08-15 at 8.55.06 AM.jpg

Longest read in P95 was 5 ms (which is terrific), no errors.

Leaving this code to bake for the day, but this looks like a passing ticket so far.

cris-oddball commented 1 year ago

10 second load test run on Perf with Locust

Screenshot showing that the reader instance was used: Screen Shot 2023-08-15 at 10.00.44 AM.jpg

cris-oddball commented 1 year ago

Deploying to Staging to monitor.

cris-oddball commented 1 year ago

Nice graph showing a boost in reads on staging Screen Shot 2023-08-15 at 10.43.12 AM.jpg

Staging Regression: 134 passed, 3 xfailed in 153.85s (0:02:33)

cris-oddball commented 1 year ago

a lovely graph from staging Screen Shot 2023-08-15 at 1.42.05 PM.jpg

department-of-veterans-affairs / notification-api