cds-snc / platform-forms-client

NextJS application that serves the public-facing website for Forms

https://forms-staging.cdssandbox.xyz/

MIT License

34 stars 13 forks source link

refactor: update confirm api to a server action #4042

Closed bryan-robitaille closed 1 month ago

bryan-robitaille commented 2 months ago

Summary | Résumé

This PR refactors the confirm API to a server action as well implements database throttling to handle larger read / write loads.

Instead of trying to implement exponential back off and retries manually this PR instead uses the built in retry services that are included by default in the AWS SDK. The number of retries for the dynamodb service has been increased from 3 to 15. In LocalStack with the dynamodb set to return throttle capacity exceptions 50% of the time we can see the AWS retry logic at work.

As an additional throttling mechanism a 200ms delay is introduced between batched calls to more evenly spread out the load over the partition key (form ID)

BatchGetCommand	TransactWriteItems Command	Query Command

Test instructions | Instructions pour tester la modification

Setup:

Publish a Form
Generate a ton, and I mean of ton of responses for it
Set the response download limit to something ridiculous like 1000

Test

Download 1000 responses at once
Confirm 1000 responses at once

The actions should not throw errors and should complete, albeit a little more slowly, for a user. If there are dynamodb warnings thrown in Slack it should have no impact on the user.

Unresolved questions / Out of scope | Questions non résolues ou hors sujet

Still need to refactor "report a problem" and "unprocessed submissions" APIs to server actions.

Pull Request Checklist

Please complete the following items in the checklist before you request a review:

[ ] Have you completely tested the functionality of change introduced in this PR? Is the PR solving the problem it's meant to solve within the scope of the related issue?
[ ] The PR does not introduce any new issues such as failed tests, console warnings or new bugs.
[ ] If this PR adds a package have you ensured its licensed correctly and does not add additional security issues?
[ ] Is the code clean, readable and maintainable? Is it easy to understand and comprehend.
[ ] Does your code have adequate comprehensible comments? Do new functions have docstrings?
[ ] Have you modified the change log and updated any relevant documentation?
[ ] Is there adequate test coverage? Both unit tests and end-to-end tests where applicable?
[ ] If your PR is touching any UI is it accessible? Have you tested it with a screen reader? Have you tested it with automated testing tools such as axe?

github-actions[bot] commented 2 months ago

:test_tube: Review environment

https://uhunrds6otcn2kyt4xt3ws66ha0rimrk.lambda-url.ca-central-1.on.aws/

thiessenp-cds commented 2 months ago

The download works really well!

I did a quick test on staging and then locally. On staging the download is really fast but it looks like the submissions originally have the downloadedAt var set and then soon after nulled. I tried this locally. Uploading the 1000 submissions took about 2 hours :) I saw the same behaviour. Attached a video that may help explain. I suspect this is about the TODOs - refactoring the "report a problem" and "unprocessed submissions". That could be a follow up PR?

https://github.com/user-attachments/assets/6d839127-475b-429f-b3a9-339404460b69

wmoussa-gc commented 1 month ago

I feel bad for not knowing why my original fix didn't work. https://github.com/cds-snc/platform-forms-client/pull/3619

craigzour commented 1 month ago

Works pretty well! Good job @bryan-robitaille :)

If we decide to move forward with the increased response limit of 1000 I wonder if we should not also be adding delay in the loops for the following functions: listAllSubmissions, retrieveSubmissions. However this could also have an impact on the user experience if form owners downloading forms start to refresh the page thinking that the process takes more time than usual (I did experiment it while testing this PR). Since we wait for the whole process to be completed to return the responses then if you refresh just before the end you will end up having responses that have been moved to the downloaded tab even though you did not even get them once.

Also just want to know whether the implementation of getExponentialBackoffTimeInMS in the retrieveSubmissions function is intentional or not because from my understanding the DynamoDB SDK would still do it for us in case we get UnprocessedKeys to reprocess.

bryan-robitaille commented 1 month ago

Works pretty well! Good job @bryan-robitaille :)

If we decide to move forward with the increased response limit of 1000 I wonder if we should not also be adding delay in the loops for the following functions: listAllSubmissions, retrieveSubmissions. However this could also have an impact on the user experience if form owners downloading forms start to refresh the page thinking that the process takes more time than usual (I did experiment it while testing this PR). Since we wait for the whole process to be completed to return the responses then if you refresh just before the end you will end up having responses that have been moved to the downloaded tab even though you did not even get them once.

Also just want to know whether the implementation of getExponentialBackoffTimeInMS in the retrieveSubmissions function is intentional or not because from my understanding the DynamoDB SDK would still do it for us in case we get UnprocessedKeys to reprocess.

@craigzour the 1000 response limit was only for testing that this does increase the performance based on our current baseline. There is still much work to do before we can get to the point of setting 1000 as the default limit but I considered it out of scope of the current PR.

The use of exponential backoff in retrieveSubmissions is intentional because 'UnprocessedKeys' are not retried automatically because the 'Batch Commands' do not return a 4xx or 5xx error. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html#Programming.Errors.BatchOperations

bryan-robitaille commented 1 month ago

I feel bad for not knowing why my original fix didn't work. #3619

@wmoussa-gc don't feel bad! It did work but only covered off confirming responses. It was also redundant of the logic that TransactWriteItems was already doing behind the scenes to retry. So it was kinda like retry logic on top of retry logic.

bryan-robitaille commented 1 month ago

Also just want to know whether the implementation of getExponentialBackoffTimeInMS in the retrieveSubmissions function is intentional or not because from my understanding the DynamoDB SDK would still do it for us in case we get UnprocessedKeys to reprocess.

@craigzour sorry I misunderstood your comment. I think I understand now what you were referencing. Basically, why do we implement a delay if there are 'UnprocessedKeys' because in the next iteration of the BatchGetCommand in the while loop the AWS SDK will already implement retries and throttling for the unretrieved keys.

So my thought process with introducing the delay is as follows: Context : We pass in an array of 50 items to the BatchGetCommand

Without the delay

The BatchGetCommand fails and retries after 200ms, the second retry succeeds and returns 10 items with 40 in 'Unprocessed Keys.
40 items are passed back in and are immediately retried and fails. Internally the second retry also fails and delays, the 3rd retry succeeds and returns 10 items with 30 in 'UnprocessedKeys'

My conclusion was that it was kind of a rapid fire approach where we knew the dynamoDB was busy but we immediately retried anyway. The subsequent invocation of BatchGetCommand runs immediately because the SDK retry mechanism starts fresh every time.

With the delay

The BatchGetCommand fails and retries after 200ms, the second retry succeeds and returns 10 items with 40 in 'Unprocessed Keys.
We have an indicator that the DB is busy so we delay 200ms externally before calling the BatchGetCommand again for the same data.
40 items are passed back in and are retried and fail. Internally the second retry also fails and delays, the 3rd retry succeeds and returns 20 items with 20 in 'UnprocessedKeys'
We have an indicator that the DB is really busy so we delay 400ms externally before calling the BatchGetCommand again.

With this version I thought we were being proactive by giving the dynamoDB a chance to catch up by reducing the load on it and giving it a better probability of completing the request with all items.

craigzour commented 1 month ago

Also just want to know whether the implementation of getExponentialBackoffTimeInMS in the retrieveSubmissions function is intentional or not because from my understanding the DynamoDB SDK would still do it for us in case we get UnprocessedKeys to reprocess.

@craigzour sorry I misunderstood your comment. I think I understand now what you were referencing. Basically, why do we implement a delay if there are 'UnprocessedKeys' because in the next iteration of the BatchGetCommand in the while loop the AWS SDK will already implement retries and throttling for the unretrieved keys.

So my thought process with introducing the delay is as follows: Context : We pass in an array of 50 items to the BatchGetCommand

Without the delay

The BatchGetCommand fails and retries after 200ms, the second retry succeeds and returns 10 items with 40 in 'Unprocessed Keys.

40 items are passed back in and are immediately retried and fails. Internally the second retry also fails and delays, the 3rd retry succeeds and returns 10 items with 30 in 'UnprocessedKeys'

My conclusion was that it was kind of a rapid fire approach where we knew the dynamoDB was busy but we immediately retried anyway. The subsequent invocation of BatchGetCommand runs immediately because the SDK retry mechanism starts fresh every time.

With the delay

The BatchGetCommand fails and retries after 200ms, the second retry succeeds and returns 10 items with 40 in 'Unprocessed Keys.

We have an indicator that the DB is busy so we delay 200ms externally before calling the BatchGetCommand again for the same data.

40 items are passed back in and are retried and fail. Internally the second retry also fails and delays, the 3rd retry succeeds and returns 20 items with 20 in 'UnprocessedKeys'

We have an indicator that the DB is really busy so we delay 400ms externally before calling the BatchGetCommand again.

With this version I thought we were being proactive by giving the dynamoDB a chance to catch up by reducing the load on it and giving it a better probability of completing the request with all items.

Got it! Then it looks good to me ;)