CarnegieLearningWeb / UpGrade

Framework for adding A/B testing to education applications
https://www.upgradeplatform.org/
BSD 3-Clause "New" or "Revised" License
25 stars 12 forks source link

BadRequestError #1092

Open danoswaltCL opened 9 months ago

danoswaltCL commented 9 months ago

Version where bug was found: I believe this is only showing up in this way in version 5

Describe the bug In New Relic one can see that we have had a couple instances where calls to api/v1/useraliases from the Java client library in Mathia production (version 3, which can be run and built from release/v3.0.20 branch) resulted in 3 big clusters of around 100+ BadRequestErrors.

image

When this error was first noted in September after the first attempt to push version 5 to production, this BadRequestError seemed to be implicated in server times becoming unusable, but the root cause was unknown, and it seemed that this error was a symptom rather than a cause, and it did not reappear once other steps had been taken, so it was thought that this was thought a red herring / skeptical that this error was accurately representing an actual bad request.

It came up again last week, the timing of the errors makes it not really likely for it to be a reason for the production troubles, it definitely causes the server become unavailable with latency going way up for a little while. The difference now is that we can see the request that is coming in that is getting rejected as a BadRequest: the request body is an empty object:

{
   client_session_id: xdtutmtvhprlbr8o48zlpzxouiw:1698073659781
   endpoint: /api/v1/useraliases
   http_request_id: 6c27b841-faec-4a34-8b32-598f89dd4e42
   level: error
   message: { [[-]](https://splunk.carnegielearning.com/en-US/app/search/search?q=search%20index%3D%22cl_upgrade%22%20BadRequestError&display.page.search.mode=verbose&dispatch.sample_ratio=1&earliest=-7d%40h&latest=now&display.prefs.events.offset=60&sid=1698438186.76541#)
     endPoint: /api/v1/useraliases
     errorCode: 400
     message: Invalid body, check 'errors' property for more info.
     name: BadRequestError
     request: {}
     type: Parameter not in the correct format
   }
   request_method_type: PATCH
}

I asked @jerith666 to help with a theory, I'll just paste that conversation here. Hitting the API directly with postman with an empty object for the request body does produce the correct error, however, I am not able to reproduce how in the world the Java client library could send an empty object to api/v1/useraliases.

hey @Matt McHenry sry to throw something at you on a Friday afternoon, I still think there's something here; is possible for client.setAltIds() to have null values passed in for the parameters in here? https://github.dev.carnegielearning.com/CarnegieLearning/ctp/blob/d1298d078d118c96ac6becf43f[…]k12/launcher/bl/AbstractStudentLaunchSpecBuilder.java The reason I suspect this could be a possibility is that we definitely saw the BadRequestError business come up again back on Monday. https://one.newrelic.com/nr1-core/apm/overview/NzMyNTEzfEFQTXxBUFBMSUNBVElPTnw4MzYyNz[…]end=1698077520000&state=ad3a1eca-5531-0756-7d22-e6169806cfc6 https://one.newrelic.com/nr1-core/errors-inbox/entity-inbox/NzMyNTEzfEFQTXxBUFBMSUNBV[…]end=1698077520000&state=64704018-c039-4a7b-62ea-aedaa808e585 https://one.newrelic.com/nr1-core/errors-inbox/entity-inbox/NzMyNTEzfEFQTXxBUFBMSUNBV[…]end=1698077520000&state=576de03e-0b5a-16c7-dd3c-269291ab811f These produced error logs like this: { [-] client_session_id: xdtutmtvhprlbr8o48zlpzxouiw:1698073659781 endpoint: /api/v1/useraliases http_request_id: 6c27b841-faec-4a34-8b32-598f89dd4e42 level: error message: { [-] endPoint: /api/v1/useraliases errorCode: 400 message: Invalid body, check 'errors' property for more info. name: BadRequestError request: {} type: Parameter not in the correct format } request_method_type: PATCH } That request property is literally an empty object, which is very strange. This is a property I added to our error middleware to log out the request body specifically in response to the first time we got these bad errors. Other errors we see are definitely attaching a correct looking request body, so I looked into our Java Client library to see if it was possible to somehow send an empty object. I think that it technically could, if and only if there were null values passed in. That's the only way I can think that this BadRequestError could have happened. https://github.com/CarnegieLearningWeb/UpGrade/blob/dev/clientlibs/java/src/main/java/org/upgradeplatform/client/ExperimentClient.java#L485 https://github.com/CarnegieLearningWeb/UpGrade/blob/dev/clientlibs/java/src/main/java/org/upgradeplatform/requestbeans/UserAlias.java#L10 (edited)

GitHubGitHub Build software better, together GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 330 million projects. (7 kB) https://github.dev.carnegielearning.com/CarnegieLearning/ctp/blob/d1298d078d118c96ac6becf43f403466dbba5356/runtime/services/services-business/services-k12-launch-bl/src/main/java/cl/services/k12/launcher/bl/AbstractStudentLaunchSpecBuilder.java#L174

Matt McHenry :snail: 4:51 PM I'm pretty sure that can't happen. If there were null CtContextIds in the Stream passed in, then the map(CtContextId::getIdString) would throw an NPE and we'd never even make the call to UpGrade.

Dan Oswalt 4:52 PM how about an empty list?

Matt McHenry :snail: 4:53 PM I think that would generate a request body like { userId: ..., aliases: [] }, not like {} 4:54 but yeah, if a class had an empty syllabus, I think you'd get an empty list of aliases (context ids) for any students in that class. 4:55 I doubt TT lets you create such a class, but roster integration might. 4:55 it'd be super simple to squelch the setAltIds call in that case.

Dan Oswalt 4:56 PM it's just this weird constructor, public UserAlias(){} .... my java skills aren't so great but github copilot seems to think a null userId and empty list would produce an empty object for that class and get passed in as the request body that way https://github.com/CarnegieLearningWeb/UpGrade/blob/dev/clientlibs/java/src/main/java/org/upgradeplatform/requestbeans/UserAlias.java#L10

UserAlias.java public UserAlias(){} https://github.com/[CarnegieLearningWeb/UpGrade](https://github.com/CarnegieLearningWeb/UpGrade)|CarnegieLearningWeb/UpGradeCarnegieLearningWeb/UpGrade | Added by GitHub 4:58 also have no idea if a null userId is possible, but it seemed like a maybe

Matt McHenry :snail: 5:00 PM I would expect you'd need @JsonInclude(Include.NON_NULL) annotations to get that behaviour. that's how it works with Jackson anyway -- I don't know if the client lib is using some other framework? Should be easy to add a test of this in https://github.com/CarnegieLearningWeb/UpGrade/blob/dev/clientlibs/java/src/main/java/org/upgradeplatform/client/Main.java though.

Dan Oswalt 5:01 PM oh right true, i'll try that out

Dan Oswalt 5:13 PM hm can't seem to force it to do so. ok, thx for indulging my theory, have a good weekend

ppratikcr7 commented 9 months ago

@danoswaltCL Just to add a few things. When I try to hit v1/useraliases api call with empty object and also without providing aliases key, I see the below responses:

Screenshot 2023-11-01 at 4 36 44 PM Screenshot 2023-11-01 at 4 34 47 PM
danoswaltCL commented 9 months ago

ty yeah can make it happen manually, but the question is how this could happen from the Java client (v3).

Important to note is that this BadRequestError is only happening in new v5, in v4 this happens:

{
  "type": "Experiment user not defined",
  "httpCode": 404
}

So Mathia has probably been able to send this occassional unexpected empty object via the Java 3 library for a long time, just now in v5 we are calling this a bad request (for good reason, it's not expected and probably should rightfully be rejected instead of 404).

Also important to note though is that now we have an updated version of the Java client in Mathia production that is no longer retrying 4xx errors. So the severity of this error in production should be greatly reduced, as it the effect before was that it would cause Mathia to retry way too many times and flood ourselves with useless traffic.

amurphy-cl commented 8 months ago

Error is occurring roughly 1x/week, 17 errors over the last month.

Spike/acceptance criteria: determine whether this error can be ignored or if it's indicative of problems that should be addressed.