OpenFn / unicef-cambodia

UNICEF Cambodia - Primero Interoperability
https://openfn.github.io/unicef-cambodia/
1 stars 2 forks source link

How to handle Oscar 401 errors? #61

Closed aleksa-krolls closed 3 years ago

aleksa-krolls commented 3 years ago

Background

Oscar is a custom case management application we need to access hourly to sync case data with the UNICEF Primero system. We're finding that randomly when we try to authenticate, we encounter a 401 Invalid login details error (see example) that causes the full data sync flow to fail even though nothing changes about the credentials or auth pattern used. When I try re-running the failed data flow, I typically repeated hit the same error for the next ~5-10 minutes... then something seems to reset on the Oscar side and it starts working again with no issues. From Rotati...
If you encounter 401 error please re-authenticate again or just as you did you just resend the payload again. There is nothing wrong with the authentication. It is because we using multitenant applications. It is a complex explanation if i try to describe but we are going to solve this issue very soon.

Filter Activity History by the Job: Oscar API Endpoint Monitor to see the pattern of when Oscar authentication fails/ succeeds.

Error: Server responded with:  
{
  "statusCode": 401,
  "body": {
    "success": false,
    "errors": [
      "Invalid login credentials. Please try again."
    ]
  },

The request

The Oscar dev team hasn't resolved this issue, so I'm wondering if we should think through a re-try mechanism on the OpenFn side to re-send the request after ~5 min if we encounter such an error. What would this retry solution look like? And what would be the LOE (# hours) to implement? (Consider that it might be hard to find budget here, but I'm worried how this will affect the integrity of the whole solution.)

TDs initial thoughts: We could (a) extend the platform to allow multiple triggers per job (exciting!) or (b) create a duplicate job that’s kept in sync via GitHub.

taylordowns2000 commented 3 years ago

Questions for this one @aleksa-krolls :

  1. Is it only on the post to /api/v1/auth/sign_in that we get the error, or do we also get it when hitting the /api/v1/organizations/clients/upsert/ endpoint with the auth token we got from sign_in?
  2. This strikes me as critical for f1-j2, as we could lose state since we're not allowed to persist any data for retries. It does not strike me as critical for f2-j1 as it will get retried in 30 minutes. Is that the right understanding?
  3. This just a thought: what if we simply caught the 401 inline and retried right there? This now feels cleaner than keeping another job in sync. A good old fashioned try/catch might do, if we controlled the initial request inside execute(...) by hand.
aleksa-krolls commented 3 years ago

@taylordowns2000 @lakhassane Thanks for the thoughtful questions and input... looking like maybe Oscar has resolved related issues on their side! But helpful to have an basic understanding of potential ways forward. Keep you posted.