This is a cloud function based webhook processing service that is being using to get webhook calls from RapidPro to capture different analytical data point.
GNU Affero General Public License v3.0
4
stars
1
forks
source link
Webhook failure in logging and monitoring system. #373
Describe the bug
Webhook failures are becoming frequent due to lengthy processing times of certain system processes, which are generating error logs in the logging and monitoring dashboard.
Add diagnosis details
When webhook failures are happening, the CPU utilization is also at its limit of 100%.
The time of these webhook failures is mainly 11:30 pm to 2:30 am - Midnight.
At this time the process, which is in the running state mainly Dry-Flows:
Data-sync dry_flow.
Dost program finish flow. (Removed webhook call from the flow on 18th Apr 2023 as we are updating specific data here, whereas, the Dost-DataSync-DryFlow is anyhow updating these details)
Expected behavior
The dry flow should be processed correctly without any error logs and webhook failures.
Screenshots
Root cause
The webhooks are taking too long to complete during midnight. Due to this, the Google function is getting timeout (after 5 minutes) and webhook are failing without complete process.
With the growing number of users in the system, these dry flows started taking a significant amount of time with the user's count growing, causing subsequent webhook calls to be delayed until the database becomes available for any database transaction.
The Database CPU Utilization is at 100% as the number of requests at bombing during this time (11:30 pm to 2:30 am).
Additionally, the database tables responsible for storing user custom fields and user groups are growing, which is causing an increase in execution time. These tables are updated during the processing of these Dryflow webhooks.
As the webhook failure is happening for dry flow, here is the 4-step plan to fix this:
[x] There were two flows triggering the dry flow - program completion and daily sync-up flow. Which were essentially doing the same thing. We have to remove the webhook call from one of the flow (probably from the program completion flow)
[x] Refactor the codebase to handle dry flow separately.
[x] Archive data from the Custom field and groups table. It will be expected to reduce the load by 50%. [For now we are archiving the data manually.]
[x] Write a query to delete the data.
[x] Delete the data from the tables.
[x] Adding queue-based webhook URL in the dry_flow.
Describe the bug Webhook failures are becoming frequent due to lengthy processing times of certain system processes, which are generating error logs in the logging and monitoring dashboard.
Add diagnosis details
Dost-DataSync-DryFlow
is anyhow updating these details)To Reproduce Steps to reproduce the behavior:
Expected behavior The dry flow should be processed correctly without any error logs and webhook failures.
Screenshots
Root cause
Impact Daily: 45,000 - 50,000 webhook failures.
Criticality
Acceptance Criteria
Documentation