Closed nikolai-efimov closed 1 year ago
@nikolai-efimov I notice you have this listed as draft. Are you looking to add/adjust the content before we review it? Or is it ready to review? cc @k-macmillan
@mjones-oddball , it's ready for a review. Thank you!
Hey team! Please add your planning poker estimate with Zenhub @cris-oddball @justaskdavidb2 @k-macmillan @kalbfled @ldraney @nikolai-efimov
Nikolai's recommendation is that we do ticket 1345 before this one.
Because we're returning a database model return query.one()
instead of actual data in dao_fetch_service_by_id_with_api_keys
, we should realize that now that object is going to be bound to READ database instance, which means that we MUST make sure that NO db updates happen using authenticated_service
@nikolai-efimov to duplicate the closed caching ticket, 1345 and clean it up if needed and proceed with opening a PR and testing this ticket.
Double call to auth seems to be failing unit test because one of them fails.
This is trace from normal app execution showing two calls to auth per single http request:
INFO 2023-08-01 22:02:24,396 /app/app/authentication/auth.py:182: API authorised for service d6aa2c68-a2d9-4437-ab19-3ae8eb202553 with api key ef12892a-1a02-425d-aa48-05fc345efaf2, using client PostmanRuntime/7.32.3
2023-08-01T22:02:24 app app INFO None "API authorised for service d6aa2c68-a2d9-4437-ab19-3ae8eb202553 with api key ef12892a-1a02-425d-aa48-05fc345efaf2, using client PostmanRuntime/7.32.3" [in /app/app/authentication/auth.py:182]
INFO 2023-08-01 22:02:24,401 /app/app/authentication/auth.py:182: API authorised for service d6aa2c68-a2d9-4437-ab19-3ae8eb202553 with api key ef12892a-1a02-425d-aa48-05fc345efaf2, using client PostmanRuntime/7.32.3
2023-08-01T22:02:24 app app INFO None "API authorised for service d6aa2c68-a2d9-4437-ab19-3ae8eb202553 with api key ef12892a-1a02-425d-aa48-05fc345efaf2, using client PostmanRuntime/7.32.3" [in /app/app/authentication/auth.py:182]
this seems to be the in app/init.py place where auth calls are being bound to requests
v2_notifications.before_request(validate_service_api_key_auth)
get_notifications.before_request(validate_service_api_key_auth)
application.register_blueprint(v2_notifications)
before_request registers a function to be run before each request within a blueprint regardless of the http method. Since both cover /v2/notifications , we could probably remove one.
commit that introduced 2nd binding is https://github.com/department-of-veterans-affairs/notification-api/commit/8050013a061d228bf1890085de6eaf0637e72799
ci-test-1 | statement = 'SELECT services.id AS services_id, services.name AS services_name, services.created_at AS services_created_at, servic... services LEFT OUTER JOIN api_keys AS api_keys_1 ON services.id = api_keys_1.service_id \nWHERE services.id = %(id_1)s'
ci-test-1 | parameters = {'id_1': '9adb8533-da7e-4027-9095-62fd1660e838'}
ci-test-1 | context = <sqlalchemy.dialects.postgresql.psycopg2.PGExecutionContext_psycopg2 object at 0xffff89b77130>
ci-test-1 |
ci-test-1 | def do_execute(self, cursor, statement, parameters, context=None):
ci-test-1 | > cursor.execute(statement, parameters)
ci-test-1 | E sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) relation "services" does not exist
ci-test-1 | E LINE 2: FROM services LEFT OUTER JOIN api_keys AS api_keys_1 ON serv...
data is missing in PostgreSQL - DEV WRITE/Databases/test_notification_api_gw0/Schemas/public/Tables/services table
how are these databases populated?
Branch deployed to Dev. User Flows and the Regression both failed on a status being stuck in created (this is because the Dev queue is overloaded from the testing 3 days ago).
The DB monitors in RDS for the Reader instance are not granular enough to see if the Reader instance is being used. The spike seen in the graphs below were observed to occur right at the end of the deploy. Unclear if User Flows was running. No corresponding spike was seen when the Regression ran.
There are no Reader logs for the last 30 minutes since the deploy (during which time the regression ran), so it's hard to tell if the Reader instance is being utilized.
Unclear where to effectively look in Datadog for any of this information.
Nikolai to ask Kyle or Dave to jump onto the Dev DB via the jump host.
403 test case is fixed.
re-requesting PR reviews from @k-macmillan and @kalbfled
@Cris Stoddard, deployment went through successfully , including user flows. We can start on QA testing whenever you have time for it. Thank you. https://github.com/department-of-veterans-affairs/notification-api/actions/runs/5814934646
Code is merged and deploying to Perf for testing.
After deploying master to Perf, the first regression ran in the cloud had a number of failures:
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Basic Formatting]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Personalization]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Custom Reply-To on Template]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Basic Formatting and Optional Properties]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Basic Formatting w/ VAPROFILEID and Optional Properties]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Basic Formatting w/ ICN and Optional Properties]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Basic Formatting w/ BIRLSID and Optional Properties]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Basic Formatting w/ EDIPI and Optional Properties]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Default Send True w/ VAPROFILEID w/o Comm Items]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Default Send False w/ VAPROFILEID w/o Comm Items]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Default Send True w/ VAPROFILEID w/ Allowed Comm Items]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Default Send False w/ VAPROFILEID w/ Allowed Comm Items]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Default Send True w/ VAPROFILEID w/ Disallowed Comm Items]
FAILED tests/test_v2_notifications_email.py::test_send_email[Email: Default Send False w/ VAPROFILEID w/ Disallowed Comm Items]
FAILED tests/test_v2_notifications_email.py::test_send_email_missing_required_property_template_id
FAILED tests/test_v2_notifications_email.py::test_send_email_missing_required_property_contact_info
FAILED tests/test_v2_notifications_email.py::test_send_email_incorrect_type_template_id_400
FAILED tests/test_v2_notifications_email.py::test_send_email_incorrect_format_email_400
FAILED tests/test_v2_notifications_email.py::test_send_email_invalid_type_email_400
FAILED tests/test_v2_notifications_email.py::test_send_email_incorrect_type_recipient_identifier_400
FAILED tests/test_v2_notifications_email.py::test_send_email_incorrect_type_recipient_identifier_properties_400
FAILED tests/test_v2_notifications_email.py::test_send_email_incorrect_type_reference_400
FAILED tests/test_v2_notifications_email.py::test_send_email_incorrect_type_email_reply_to_id_400
FAILED tests/test_v2_notifications_email.py::test_send_email_incorrect_type_billing_code_400
FAILED tests/test_v2_notifications_email.py::test_send_email_incorrect_type_personalisation_400
FAILED tests/test_v2_notifications_email.py::test_send_email_invalid_template_id_400
FAILED tests/test_v2_notifications_email.py::test_send_email_invalid_email_reply_to_id_400
FAILED tests/test_v2_notifications_email.py::test_send_email_unexpected_payload_properties_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Short] - Assert...
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Short Personalized]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Short w/ URL]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Short w/ optional properties]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Long 160+] - As...
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Short w/ alternate Pinpoint SMS Sender ID]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Short w/ BIRLSID recipient ID and Optional Properties]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Short w/ ICN recipient ID and Optional Properties]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Short w/ EDIPI recipient ID and Optional Properties]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Short w/ VAPROFILEID and Optional Properties]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Default Send True w/ VAPROFILEID w/o Comm Items]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Default Send False w/ VAPROFILEID w/o Comm Items]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Default Send True w/ VAPROFILEID w/ Allowed Comm Items]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Default Send False w/ VAPROFILEID w/ Allowed Comm Items]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Default Send True w/ VAPROFILEID w/ Disallowed Comm Items]
FAILED tests/test_v2_notifications_sms.py::test_send_sms[SMS: Default Send False w/ VAPROFILEID w/ Disallowed Comm Items]
FAILED tests/test_v2_notifications_sms.py::test_send_sms_missing_required_property_template_id_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_missing_required_property_contact_info_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_incorrect_type_template_id_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_incorrect_format_phone_number_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_incorrect_type_phone_number_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_incorrect_type_recipient_identifier_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_incorrect_type_recipient_identifier_properties_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_incorrect_type_sms_sender_id_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_incorrect_type_reference_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_incorrect_type_billing_code_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_incorrect_type_personalisation_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_invalid_template_id_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_invalid_sms_sender_id_400
FAILED tests/test_v2_notifications_sms.py::test_send_sms_unexpected_payload_properties_400
============= 58 failed, 72 passed, 4 skipped, 3 xfailed in 38.07s =============
Error: Process completed with exit code 1.
Basically all of the email and sms routes (the GET v2/notifications test did not run for either since no emails or sms were created. They all appeared to fail with 500 Internal Server Error responses.
Testing from local against the public endpoints next.
Exactly the same results when running the regression locally. 500 Internal Server Error.
Sent myself an email via Postman, same result.
{
"message": "Internal server error",
"result": "error"
}
Looked briefly in the logs and found a very long stack trace ending in KeyError: 'read-db'
Datadog shows no hits to the read only instance using the query during the time the regression ran:
env:perf service:postgres operation_name:postgres.query @network.destination.ip:*-ro-*
AWS RDS Perf cluster shows no apparent use of the reader during the time frame
Was Perf configured as Dev? Does it have parameter for SQLALCHEMY_DATABASE_URI_READ ?
resource "aws_ssm_parameter" "database_uri_read" {
name = "/${var.environment}/notification-api/database/uri_read"
description = "The READ database URI for ${var.environment}"
type = "SecureString"
value = format(
"postgresql://%s:%s@%s:%s/%s",
module.db.this_rds_cluster_master_username,
module.db.this_rds_cluster_master_password,
module.db.this_rds_cluster_reader_endpoint,
module.db.this_rds_cluster_port,
module.db.this_rds_cluster_database_name)
key_id = local.notification_kms_key_id
tags = local.default_tags
}
@nikolai-efimov Perf does have the following:
SQLALCHEMY_DATABASE_URI_READ
uri_read
in the parameter storeWhat else needs to be checked here to make sure the code deployed matches the actual infrastructure?
This is the traceback:
File \"/usr/local/lib/python3.8/site-packages/flask/app.py\", line 1482, in full_dispatch_request
rv = self.preprocess_request()
File \"/usr/local/lib/python3.8/site-packages/ddtrace/contrib/flask/helpers.py\", line 28, in wrapper
return func(pin, wrapped, instance, args, kwargs)
File \"/usr/local/lib/python3.8/site-packages/ddtrace/contrib/flask/patch.py\", line 702, in _traced_request
return wrapped(*args, **kwargs)
File \"/usr/local/lib/python3.8/site-packages/flask/app.py\", line 1974, in preprocess_request
rv = self.ensure_sync(before_func)()
File \"/usr/local/lib/python3.8/site-packages/ddtrace/contrib/flask/wrappers.py\", line 76, in trace_func
return wrapped(*args, **kwargs)
File \"/app/app/authentication/auth.py\", line 153, in validate_service_api_key_auth
service = dao_fetch_service_by_id_with_api_keys(client)
File \"/app/app/dao/services_dao.py\", line 200, in dao_fetch_service_by_id_with_api_keys
with get_reader_session() as session:
File \"/usr/local/lib/python3.8/contextlib.py\", line 113, in __enter__
return next(self.gen)
File \"/app/app/dao/dao_utils.py\", line 101, in get_reader_session
engine = db.engines['read-db']\nKeyError: 'read-db'
Fix to bind to SQLAlchemy being queued up, will deploy this branch directly to Perf for testing before merging to master.
Deploying Nikolai's new branch to Perf for validation.
Failing on migrations.
Issue is related to multiple engines, when tries to talk to RO is failing (?). Which instance is it trying to do migrations on?
The next steps:
# DB conection string
SQLALCHEMY_DATABASE_URI = os.getenv('SQLALCHEMY_DATABASE_URI')
SQLALCHEMY_DATABASE_URI_READ = os.getenv('SQLALCHEMY_DATABASE_URI_READ')
SQLALCHEMY_BINDS = {"read-db": os.getenv("SQLALCHEMY_DATABASE_URI_READ")} # newline
ON DEV, the deploy with the fix on 1344v3 succeeds (migration is run). re-deploying the failed 1344v3 to perf with debug logging on.
Reverted the 1344 commit and merged to master.
@nikolai-efimov Maybe this is useful: https://stackoverflow.com/questions/63606701/flask-migrate-for-multiple-database
Making a note here about 1344 so we’re on the same page. 1344v2 deploys fine to Dev and Perf, but caused an error in Perf because read-db engine didn’t get attached. 1344v3 (with 1-line fix) deploys fine in Dev, but fails on migrations in Perf. Working theory right now is that v3 might be triggering migrations on read-only instance, which it should not be doing / or something around multiple engines + migrations. The question is what’s different between Dev and Perf. I’m looking into it and will report back with v4 to try
Nikolai pushed an update to the 1344v3 branch to handle perf, staging and prod the same way that dev is handled. This branch also failed to run the migrations.
I'll copy things I'm posting in eng chat here so @tabinda-syed and @mjones-oddball are in the loop.
1344v2 deploys fine to Dev and Perf, but caused an error in Perf because read-db engine didn’t get attached. 1344v3 (with 1-line fix) deploys fine in Dev, but fails on migrations in Perf. Working theory right now is that v3 might be triggering migrations on read-only instance, which it should not be doing / or something around multiple engines + migrations. The question is what’s different between Dev and Perf. I’m looking into it and will report back with v4 to try
Most evidence points to multiple engines not being an issue unless we specifically ask to run migrations on all additionally bound databases. In our files we just run CMD ["sh", "-c", "flask db upgrade"]
which does not specify any additional dbs to be migrated. I also don't see anything wrong with migrations/env.py
Given that the updated v3 failed on migrations again on Perf and it was ok on Dev, and all configurations seem to be as similar as they can be. Next two suspects are ENV and DATA. Data being more of a suspect because there might be some data that is preventing migrations from happening on Perf. Existing data in the database might affect the migration if there are unique constraints that would be violated by migrations.
Next Action Items
Seems like the problem was undefined env variable SQLALCHEMY_DATABASE_URI_READ, which was missing from some of the task definition files. It was working okay in Dev because it was using a localhost default, which is currently unused in celery and would not cause any problems, but which would fail migrations if not specified at all.
Latest branch opened for PR successfully ran the migration when deployed to Perf. Cloud regression for this branch on Perf: 132 passed, 2 skipped, 3 xfailed in 153.51s (0:02:33) Awaiting approvals.
PR approved and merged, sending to Perf to complete testing by checking that the reader instance is actually used as described.
Master successfully deployed to Perf (including migrations). Testing in progress.
Speedy response (typically the first run after a deploy can take longer than usual) 132 passed, 2 skipped, 3 xfailed in 133.66s (0:02:13)
Datadog APM Traces with the following query to isolate the reader instance
env:perf service:postgres operation_name:postgres.query @network.destination.ip:*-ro-*
shows 44 reads in the past 15 minutes, while the regression ran:
Longest read in P95 was 5 ms (which is terrific), no errors.
Leaving this code to bake for the day, but this looks like a passing ticket so far.
10 second load test run on Perf with Locust
Screenshot showing that the reader instance was used:
Deploying to Staging to monitor.
Nice graph showing a boost in reads on staging
Staging Regression: 134 passed, 3 xfailed in 153.85s (0:02:33)
a lovely graph from staging
User Story - Business Need
The database receives a lot of requests, so we need to configure it in a way that reduces the load on write instance(s).
User Story(ies)
As a system I want my read requests hit the read db instance So that it reduces the load on the write instance
Additional Info and Resources
Multiple Databases with Binds Configure Flask-SQLAlchemy to use multiple databases
In our case we don't want to bind models (as in linked examples), we just want read from a different engine to avoid having to change a lot of code.
Basically we want to do something like this:
Files will need to be changed. There might be some other places that will need to be updated.
Engineering Checklist
Acceptance Criteria
dao_fetch_service_by_id_with_api_keys
query is changed to utilize the read instanceNone is already present in this session
QA Considerations
Test to ensure all read and write operations are working as expected
Potential Dependencies
Out of Scope
Identify and modify all places in the code where fetching can be done from a read db instance