Open mahalakshme opened 12 months ago
@petmongrels @vinayvenu Currently considering the no of incomplete syncs is huge everyday, eg: for apfodisha yesterday it is 108, but in bugsnag for the entire last week not more than 30 failed syncs - doesn't feel like all the incomplete syncs not displayed in bugsnag is because user discontinued the sync, since the discrepancy is huge. So not sure if bugnsnag is showing all the issues faced by users. So, to start with, I feel we can add the error_message to understand what kind of errors we get and then we can move to status codes - this will also help in quickly categorising the errors since I see(from the bugsnag logs) if we use status codes majority will fall under "Unknown error happened" category which will not be much use to us to quickly resolve the issue. Having error message might help us to know what kind of error we face frequently and prioritizing that. I think once we get a clarity on the kinds of error messages faced by users we can move to status codes. Let me know if you people feel differently. cc: @arjunk
@mahalakshme Agree with what needs to be done. One way to make this work is to move the wider columns (entity_status, device_info and the new column we plan to add) to a different table - sync_telemetry_details. That way our regular status queries and filters will work faster and we need to go to the details only when required. The error message alone can stay in the sync_telemetry table. Needs a spike to prove this theory.
This will mean changes to ETL as well.
@vinayvenu https://stackoverflow.com/questions/26555797/does-number-of-columns-in-a-table-affect-the-performance-of-a-count-query-on - yeah that might work. But I feel we need not club it with this card since I dont see it takes much time(less than 10s for 1.5 months) when we filter by dates. I thought we can have sync_failure_telemetry
table and entries for this will be made only on sync_failed status. Considering we already have too many entries thought we need to denormalise here to avoid empty cells. Let me know if u think otherwise we can discuss over a call.
Not very convinced. Lets talk solution in the Monday call.
As a developer/implementation engineer/analyst, I want to know if the sync incompletion is because of sync failure. Currently even when users closes the app when sync is happening when they get a call say, sync_telemetry has status as incomplete. It is marked incomplete when there is error message as well. Also it would help to know the reason for failure.
Acceptance criteria:
Why important:
In every release we are playing cards that will touches major sync areas. But we currently dont have a good enough mechanism to track down sync failures quickly.
Out of scope:
Inputs: