Closed ArwenQin closed 1 day ago
During the testing, we found an intermittent issue where PDF attachments were missing from incorporation emails. The issue occurs approximately 2 out of 10 times when incorporated business filings.
After investigating the logs, we found that these missing attachments are caused by 503 errors from the Report API service.
After reaching out to Andriy, we confirmed several important points:
The Report-API service in dev/test environments was running with 512 MiB memory allocation, which caused the 503 error as the REPORT API container did not have enough memory allocated. A typical incorporation certificate PDF is around 644KB.
The production environment has 8GB memory allocation, so this issue may not happen in the production environment.
The REPORT API moved to GCP a few months ago, and the URL for dev is: https://report-api-dev-366678529892.northamerica-northeast1.run.app. While the old service in OpenShift is still running, the SRE team recommends that we transition to using the new GCP endpoint.
After Andriy bumped up the memory from 500 MB to 2 GB in both dev and test environments, I ran about 30 test requests on the report API endpoint. During this testing, no 503 errors were encountered. It seems the memory increase has resolved the issue.
Configuration Updates: We need to confirm if we should update the 1password secrets in DEV and TEST to use the new GCP REPORT API URL.
The intermittent missing attachments issue could be due to limited resources allocated to DEV and TEST. Should we solve it by increasing the memory or should we maintain the current configuration (512 MiB) in TEST/DEV to keep costs down? As more memory means more costs, maybe we should focus on the PROD env.
Since PROD already has 8 GB, it might not be affected, but we can't be 100% sure. Perhaps we should monitor it and create a post-launch monitoring ticket to see if we encounter this issue in production.
@leodube-aot @vysakh-menon-aot @severinbeauvais What do you all think about these findings? We'd appreciate hearing your thoughts and suggestions on how we should address these issues.
This is the same root cause of intermittently missing PDFs for voluntary dissolution documents (rare, but happened) #24318
And I also have 2 questions regarding the next steps:
Same here, would like to know your thoughts 😃
@AimeeGao Update to use report-api in GCP. While updating please make sure that there is no auth issues
Update 1Password and Openshift Secret (since updating secret with 1password is disabled in openshift. ref: #23222)
Fantastic investigation, Aimee! 👍
Thanks for your response, Vysakh.
Thanks, Vysakh and Sev! Good points. For the first question, do you think it would be helpful if we post a message in the channel to let everyone know that we're planning to switch to the GCP Report API? As for the second question, maybe I can check with Andriy since Patrick is on vacation. Does that sound good?
For the first question, I think specific people have to be notified directly. Do you know who owns Report API?
For the second question, yes, sure.
For the first question, I think specific people have to be notified directly. Do you know who owns Report API?
For the second question, yes, sure.
I’m not entirely sure who the owner of the Report API is. However, I know that the code comes from this repo: https://github.com/bcgov/bcros-common/tree/main/report-api. Maybe we could check with someone familiar with this repository to confirm ownership?
The last question is whether we should increase the resource allocation for the DEV/TEST environment. Given that Andriy mentioned that DEV/TEST may not require a lot of resources, and given the high cost, do we see a need to increase the memory? Or should we keep the current 512 MiB configuration for now? ( This question is from Andriy )
These questions need to be answered by the project owners and whoever pays for the services. I thought that was Patrick, but you could also escalate through your PO.
@seeker25 Thoughts?
For the first question, I think specific people have to be notified directly. Do you know who owns Report API?
For the second question, yes, sure.
I got some feedback from Andriy regarding the OCP Report API. He mentioned that there is still traffic going through it. shutting it down earlier would have an impact.
OK, let's leave this with Andriy for now. And please tag @pwei1018 as needed.
Also, let's update the keys so that Dev and Test use the GCP instance of Report API.
Everything else is above my pay grade, so cc: @davemck513 @OlgaPotiagalova
@severinbeauvais shouldn't be an issue raising the resources. I was talking to Patrick about this before he left.. we pay way more for SQL and storage than Cloudrun services at least for auth.
Thanks, Travis.
But, soon, we'll want to use Report API in GCP. Is it stable? Should we just change the 1Pass keys? Or should we first try to see who would be affect by changing the keys and then ensure they're ready for the change?
I think it is fairly stable, we're using it for receipt generation. Probably not hard to switch back if necessary? Patrick is back off vacation on the 18th/19th? Could always just wait for then?
Thanks again, Travis. Yes, just a couple of keys to change + redeploy.
@AimeeGao , it sounds like it would be OK to change it for Dev and Test right now and then park this ticket (or create a duplicate) for changing Prod later.
Thanks again, Travis. Yes, just a couple of keys to change + redeploy.
@AimeeGao , it sounds like it would be OK to change it for Dev and Test right now and then park this ticket (or create a duplicate) for changing Prod later.
Thanks, Sev and Travis. I also got feedback from Dave (@davemck513 ), which aligns well with your suggestions.
So, we'll proceed as follows:
- Change the 1pass. Sev, could you help me update the 1Password config, as we discussed this morning?
It's changed for Dev only. Is there any way you can test this before I change Test?
PS - What's the new URL for Test?
Thanks for the update. I've also updated the OCP Secret in Dev. I'm currently testing the changes by calling the API to verify if there are any issues with the Dev changes.
As for the Test URL, I'm still in the process of confirming it. I'll update you as soon as I have that information.
We’ve confirmed that the latest GCP configuration is in place for both Dev and Test environments, and no other changes were needed. After scaling up resources, I ran tests in both environments:
Everything looks good, no 503 errors came up during testing, and everything seems to be running smoothly.
Does anything need to be done for Prod right now?
Does anything need to be done for Prod later? (When?)
Thanks, Sev for pointing this out 👍
Our current Prod configuration is still using the old OCP setup. Here's what we’ll need to do:
It might be a good idea to make these changes after we've confirmed everything is working smoothly in Dev/Test, and then monitor the performance in Prod for a while.
I’ve created a ticket to track these updates. As for the timing, @vysakh-menon-aot, when you have a moment, could you help confirm the timing for when we should do the Prod changes? Thanks.
Test env:
Created a new BEN business with Filing ID: 395939 BEN: BC1152926
Business: BC0883763 FilingId: 152549
File an Incorporation, occasionally, the Certificate of Incorporation is not attached, though the email says it's attached. This bug is rare, out of 10 tests, it was missing twice.
Happened for a ltd:
Happened for a CCC:
https://test.business.bcregistry.gov.bc.ca/BC1147574/