Plant-for-the-Planet-org / FireAlert

Monitor Forest Fires with FireAlert
https://www.plant-for-the-planet.org/firealert
Other
7 stars 1 forks source link

Unoptimized siteAlert fetching and processing mechanism and unscalable notification creation and sending #146

Closed dhakalaashish closed 9 months ago

dhakalaashish commented 9 months ago

Bugs and their Solutions

  1. Bug Description: geo-event-fetcher has createGeoEvents and createSiteAlerts . The logic is set up such that whenever createGeoEvents runs, createSiteAlerts also run, regardless of if new geoEvents are present or not. This is computationally very demanding and expensive to the database.

    Originally, this was set up because we had assumed that there might be unprocessed geoEvents which needs processing. However, it is redundant, because whenever new geoEvents appear for that slice (which they will later), it will automatically process the old unprocessed geoEvents (if any)

    Solution: Update geo-event-fetcher cronjob to only run createSiteAlerts function if there are any new geoEvents.

  2. Bug Description: None of the cronjobs close the Prisma connection after they run. Solution: disconnect Prisma connection after each cronjob run.

  3. Bug Description: createSiteAlerts is now handled sequentially for Polygon || Point sites and MultiPolygon sites. This had increased the execution time of the first SQL. The second SQL in the createSiteAlerts function then updates all geoEvents that were processed, as isProcessed=true . When first SQL gets timed out, the cronjob despite create siteAlerts does not mark the geoEvents as processed.

    Solution

    • Update CreateSiteAlerts Service:
      • SQL should join with geoEvents for the current date only.
      • First SQL must be divided into two SQLs (One for Point-Polygon and the other for MultiPolygon) and they must run concurrently. We will not get into race condition issues as the write operations between these SQLs is completely separate from each other.
  4. Bug Description: createNotifications covers the entire database process with a Prisma transaction, which runs smoothly normally (since, normally, there are only approximately 50 new alerts), however, when there are more alerts, or somehow new siteAlerts do not get processed and unprocessed alerts keep on stacking, there will be transaction timeout. Moreover, if the dataset gets large enough, there will be memory issues, as we fetch all unprocessed siteAlerts in an array, when the array get large enough there will be memory issues. Additionally, the error handling only console logs the error, rather than returning the error status. Since, the totalNotificationCreated variable is initialized as 0, the return of 0 notifications created eludes the fact that there were any errors.

    Solution

    • Update CreateNotifications Service:

      • For redundant alerts: Only get siteAlerts which are within 24 hours
      • For timeout error: Ensure Prisma transaction is applied only to those parts that require atomicity. So, only these operations must be in a transaction:
        • processing siteAlerts,
        • creating notifications, and
        • setting lastMessageCreated for sites.
      • For memory optimization: Ensure that we fetch and process siteAlerts in a chunk. However, we must ensure that the siteAlerts for a single site must not be separated into different chunks. Pseudo Code:
      
      Fetch all unProcessed siteAlerts with ascending siteId and eventDate
      Initialize the chunkedSiteAlertsNestedArray array as 2 layers nested array.
      currentSiteId = siteId of the first siteAlert
      let nestedIndex=0:
      For each unProcessed siteAlerts: 
              initialize a new array inside main array, if not already initialized
              check if newSiteId,
              If newSiteId, and the current nestedChunk already has more than 80 elements
                  Increment the nestedIndex count
                  Update the currentSiteId
                  Make a new chunk, and add the current siteAlert in that new chunk
              Else
                  Just push the siteAlert inside of the current chunk 
      
      For each array in the chunkedSiteAlertsNestedArray:
          Do the notificationCreatorLogic
          Then create notifications, update siteAlert, and update sites in a transaction
      • For error handling: Currently, error is only being console logged. Errors must be handled by the API by returning a 500 error status with the error message. The error must also be logged in Better Stack.
  5. Bug Description: The n8n workflow for fetching and processing geoEvents, creating siteAlerts,and creating notifications and sending notifications are all set up in a single workflow. If any earlier node fails all subsequent ones don’t get evoked. And, since workflow timeout occurs in 5 minutes, even if there are no errors, there are chances of createNotification and sendNotification nodes never running. This situation is very likely because generate geo events node itself has a timeout duration of 5 minutes.

    n8n.png

    Solution:

    • Solution a: separate the create notification and send notification nodes to be an independent workflow. And, this workflow can run every 30 minutes. At the beginning, create notification can stop if there are no unprocessed siteAlerts. And, thereby the entire workflow can stop.
    • Solution b: Currently generate geo events node runs in a loop until there are no providers left to process. Since the workflow timeout occurs in 5 minutes, the workflow will cancel automatically after 5 minutes, resulting in an error. Limiting maximum loop to be say ‘5’ can mitigate this issue somewhat.
  6. Bug Description: sendNotification function updates isDelivered=true concurrently for all 20 notifications. This is very inefficient as it nearly exceeds the connection limit of the database.

    Solution: Gather the notification_ids of those whose isDelivered were true, and bulk update them in the end. Similarly bulk update the alertMethods for faliure count too.

dhakalaashish commented 9 months ago

144 solves Bugs 1, 4 and 6

145 returns success count for sendNotifications instead of just success boolean (not a bug exactly, but is unclean code)