DostEducation / RP_IVR_analytics

This is a cloud function based webhook processing service that is being using to get webhook calls from RapidPro to capture different analytical data point.
GNU Affero General Public License v3.0
4 stars 1 forks source link

Discrepancy in the database and RapidPro for user program’s data #463

Closed Sachinbisht27 closed 5 months ago

Sachinbisht27 commented 6 months ago

Describe the bug We have observed that some RP_IVR users are getting regular calls delayed by 4-5 hours. There are around 750-800 users who are getting regular calls delayed in RP_IVR.

Add diagnosis details The users are getting their calls at the preferred time slot but due to the difference in the time slot in the database, the charts and the alerts for delayed calls show discrepancies.

Expected behavior

Screenshots Weekly monitoring charts: image

User program status in the database: image

User program details from RapidPro: image

Root cause

For example: https://rapidpro.ilhasoft.in/contact/read/f023ebe5-0c40-4ded-a7fe-67db654a9970/

Impact

Action items

Criticality Medium

Acceptance Criteria

Documentation Add whatever documentation will be required here.

Satendra-SR commented 6 months ago

@kritirakheja are we going with a solution at RapidPro end? Need confirmation

Sachinbisht27 commented 6 months ago

Backfilling for the correct user time slot is completed.

kritirakheja commented 6 months ago

Will be fixing this at the Rapid pro level. Picked up in this sprint.

Sachinbisht27 commented 5 months ago

Closing the issue as completed and fixed.

Sachinbisht27 commented 1 month ago

Details on over-effort spending in this issue:

This ticket concerns a mismatch between the program details in RapidPro and the database. Before this diagnosis and backfilling, we are getting the discrepancy that the users are not getting calls at the correct timeslots.

We worked on the diagnosis of the issue and found the users are getting calls on the correct time slots but the timeslot details in the database are not synced with the RapidPro. We continued to find the issue in the RapidPro flows and the application level. We tried to replicate it for ourselves and found the issue is only at the RapidPro flows. We communicated it with the team and completed the backfilling for the user program and timeslot details.

Work done to complete this issue:

  1. To replicate the issue and find the root cause - we followed the following steps - 14 hours

    • [x] Performed testing to Replicate the same behaviour on the local machine.
    • [x] Figured out the flow which is causing the discrepancy.
    • [x] Found the issue on the flow FINAL-DOST-Pilot-OnboardingCall1.2 and then we shared this with the team on 2nd Jan.
    • [x] Extracted the list of the users impacted through this.
    • [x] Extracted and analyse the reports from KooKoo for new registration calls.
    • [x] analyse the behaviour of the impacted users on the RapidPro history.
    • [x] Extracted the RapidPro data and the database data for the users.
    • [x] Compared the details on the RapidPro for the impacted user with the database.
    • [x] Added the analysis on the Github Description.
  2. To fix the issue - we followed the following steps: 4 hours

    • [x] We completed the backfilling for the impacted users in the database on the 4th of June. As mentioned here:- https://github.com/DostEducation/RP_IVR_analytics/issues/463#issuecomment-1876563920
    • [x] We did some required fixes on RapidPro flow after the 4th of Jan
    • [x] As the fixes on the RapidPro flow were done after the backfilling. So, there were a few more data discrepancies that occurred later which were again backfilled with the same process.
Sachinbisht27 commented 1 month ago

Explanation of efforts spent -

  1. The issue was related to descripancy between database and RapidPro. We have encountered similar issues in the past and based on that analysis, we estimated total 8 hours should be enough for diagnosis and fixing (4+4).
  2. In past cases, the issue had been on application side, which we could easily diagnose. This issue turned out to be originating from RapidPro. Hence it took us additional 10 hours to replicate the issue behaviour and understand the root cause.
  3. Since the issue was at RapidPro side, it was not possible to anticipate it before. The root cause and findings were logged on Github on the same day.
  4. The fix for this was completed under original estimates of 4 hours.
  5. Here also, we were tracking these hours regularly on the same day in ColoredCow timesheet but missed updating it in GitHub. I verified the same dates entry in the CC timesheet via Google sheet history.