Bug Description: geo-event-fetcher has createGeoEvents and createSiteAlerts . The logic is set up such that whenever createGeoEvents runs, createSiteAlerts also run, regardless of if new geoEvents are present or not. This is computationally very demanding and expensive to the database.
Originally, this was set up because we had assumed that there might be unprocessed geoEvents which needs processing. However, it is redundant, because whenever new geoEvents appear for that slice (which they will later), it will automatically process the old unprocessed geoEvents (if any)
Solution:
Update geo-event-fetcher cronjob to only run createSiteAlerts function if there are any new geoEvents.
Bug Description: None of the cronjobs close the Prisma connection after they run.
Solution: disconnect Prisma connection after each cronjob run.
Bug Description: createSiteAlerts is now handled sequentially for Polygon || Point sites and MultiPolygon sites. This had increased the execution time of the first SQL. The second SQL in the createSiteAlerts function then updates all geoEvents that were processed, as isProcessed=true . When first SQL gets timed out, the cronjob despite create siteAlerts does not mark the geoEvents as processed.
Solution
Update CreateSiteAlerts Service:
SQL should join with geoEvents for the current date only.
First SQL must be divided into two SQLs (One for Point-Polygon and the other for MultiPolygon) and they must run concurrently. We will not get into race condition issues as the write operations between these SQLs is completely separate from each other.
Bug Description: createNotifications covers the entire database process with a Prisma transaction, which runs smoothly normally (since, normally, there are only approximately 50 new alerts), however, when there are more alerts, or somehow new siteAlerts do not get processed and unprocessed alerts keep on stacking, there will be transaction timeout. Moreover, if the dataset gets large enough, there will be memory issues, as we fetch all unprocessed siteAlerts in an array, when the array get large enough there will be memory issues. Additionally, the error handling only console logs the error, rather than returning the error status. Since, the totalNotificationCreated variable is initialized as 0, the return of 0 notifications created eludes the fact that there were any errors.
Solution
Update CreateNotifications Service:
For redundant alerts: Only get siteAlerts which are within 24 hours
For timeout error: Ensure Prisma transaction is applied only to those parts that require atomicity. So, only these operations must be in a transaction:
processing siteAlerts,
creating notifications, and
setting lastMessageCreated for sites.
For memory optimization: Ensure that we fetch and process siteAlerts in a chunk. However, we must ensure that the siteAlerts for a single site must not be separated into different chunks.
Pseudo Code:
Fetch all unProcessed siteAlerts with ascending siteId and eventDate
Initialize the chunkedSiteAlertsNestedArray array as 2 layers nested array.
currentSiteId = siteId of the first siteAlert
let nestedIndex=0:
For each unProcessed siteAlerts:
initialize a new array inside main array, if not already initialized
check if newSiteId,
If newSiteId, and the current nestedChunk already has more than 80 elements
Increment the nestedIndex count
Update the currentSiteId
Make a new chunk, and add the current siteAlert in that new chunk
Else
Just push the siteAlert inside of the current chunk
For each array in the chunkedSiteAlertsNestedArray:
Do the notificationCreatorLogic
Then create notifications, update siteAlert, and update sites in a transaction
For error handling: Currently, error is only being console logged. Errors must be handled by the API by returning a 500 error status with the error message. The error must also be logged in Better Stack.
Bug Description: The n8n workflow for fetching and processing geoEvents, creating siteAlerts,and creating notifications and sending notifications are all set up in a single workflow. If any earlier node fails all subsequent ones don’t get evoked.
And, since workflow timeout occurs in 5 minutes, even if there are no errors, there are chances of createNotification and sendNotification nodes never running. This situation is very likely because generate geo events node itself has a timeout duration of 5 minutes.
Solution:
Solution a: separate the create notification and send notification nodes to be an independent workflow. And, this workflow can run every 30 minutes. At the beginning, create notification can stop if there are no unprocessed siteAlerts. And, thereby the entire workflow can stop.
Solution b: Currently generate geo events node runs in a loop until there are no providers left to process. Since the workflow timeout occurs in 5 minutes, the workflow will cancel automatically after 5 minutes, resulting in an error. Limiting maximum loop to be say ‘5’ can mitigate this issue somewhat.
Bug Description: sendNotification function updates isDelivered=true concurrently for all 20 notifications. This is very inefficient as it nearly exceeds the connection limit of the database.
Solution: Gather the notification_ids of those whose isDelivered were true, and bulk update them in the end. Similarly bulk update the alertMethods for faliure count too.
Bugs and their Solutions
Bug Description: geo-event-fetcher has
createGeoEvents
andcreateSiteAlerts
. The logic is set up such that whenever createGeoEvents runs, createSiteAlerts also run, regardless of if new geoEvents are present or not. This is computationally very demanding and expensive to the database.Originally, this was set up because we had assumed that there might be unprocessed geoEvents which needs processing. However, it is redundant, because whenever new geoEvents appear for that slice (which they will later), it will automatically process the old unprocessed geoEvents (if any)
Solution: Update geo-event-fetcher cronjob to only run
createSiteAlerts
function if there are any new geoEvents.Bug Description: None of the cronjobs close the Prisma connection after they run. Solution: disconnect Prisma connection after each cronjob run.
Bug Description:
createSiteAlerts
is now handled sequentially for Polygon || Point sites and MultiPolygon sites. This had increased the execution time of the first SQL. The second SQL in thecreateSiteAlerts
function then updates all geoEvents that were processed, asisProcessed=true
. When first SQL gets timed out, the cronjob despite create siteAlerts does not mark the geoEvents as processed.Solution
CreateSiteAlerts
Service:Bug Description:
createNotifications
covers the entire database process with a Prisma transaction, which runs smoothly normally (since, normally, there are only approximately 50 new alerts), however, when there are more alerts, or somehow new siteAlerts do not get processed and unprocessed alerts keep on stacking, there will be transaction timeout. Moreover, if the dataset gets large enough, there will be memory issues, as we fetch all unprocessed siteAlerts in an array, when the array get large enough there will be memory issues. Additionally, the error handling only console logs the error, rather than returning the error status. Since, thetotalNotificationCreated
variable is initialized as 0, the return of 0 notifications created eludes the fact that there were any errors.Solution
Update
CreateNotifications
Service:lastMessageCreated
for sites.Bug Description: The n8n workflow for fetching and processing geoEvents, creating siteAlerts,and creating notifications and sending notifications are all set up in a single workflow. If any earlier node fails all subsequent ones don’t get evoked. And, since workflow timeout occurs in 5 minutes, even if there are no errors, there are chances of createNotification and sendNotification nodes never running. This situation is very likely because generate geo events node itself has a timeout duration of 5 minutes.
Solution:
generate geo events
node runs in a loop until there are no providers left to process. Since the workflow timeout occurs in 5 minutes, the workflow will cancel automatically after 5 minutes, resulting in an error. Limiting maximum loop to be say ‘5’ can mitigate this issue somewhat.Bug Description: sendNotification function updates
isDelivered=true
concurrently for all 20 notifications. This is very inefficient as it nearly exceeds the connection limit of the database.Solution: Gather the notification_ids of those whose isDelivered were true, and bulk update them in the end. Similarly bulk update the alertMethods for faliure count too.