ThreeMammals / Ocelot

.NET API Gateway
https://www.nuget.org/packages/Ocelot
MIT License
8.24k stars 1.62k forks source link

#2110 Fix for race condition on fetching the list of services in Kubernetes service discovery provider #2111

Open antikorol opened 2 weeks ago

antikorol commented 2 weeks ago

Fixes #2110

Proposed Changes

Move the creation of the services list from the class field to the method, to prevent list modification from different threads, which caused the partially modified list of services to be returned.

antikorol commented 2 weeks ago

It appears that the significant concurrency issue in the Kube class mentioned in the #2110 (comment) has been resolved, yet a genuine fixing solution for the RoundRobin class is still required.

Please note that our development process necessitates both unit and acceptance testing. It is advisable to replicate the issue with an acceptance test before implementing the bug-fixing solution. We need to replicate the bug using an acceptance test. I can draft a preliminary version of the test, and you can then refine it to create the final version.

@raman-m I'm not quite sure which fix you are referring to. In this case, there will no longer be a List of the form [service, null] because the issue with modifying the List of services from different threads has been fixed. Or do you mean adding a check that if the service or the HostAndPort property is null, then log the error and return a 404 status code for that request?

raman-m commented 6 days ago

Or do you mean adding a check that if the service or the HostAndPort property is null, then log the error and return a 404 status code for that request?

Yes, I do. Could you include this check within the RoundRobin class plz? Thank you for adding the unit test. I'm assessing the efficiency of writing an acceptance test, which is expected to be quite complicated. Have you tested the bug fix in your environment?

antikorol commented 6 days ago

Or do you mean adding a check that if the service or the HostAndPort property is null, then log the error and return a 404 status code for that request?

Yes, I do. Could you include this check within the RoundRobin class plz? Thank you for adding the unit test. I'm assessing the efficiency of writing an acceptance test, which is expected to be quite complicated. Have you tested the bug fix in your environment?

Added new error to handle this case when service or HostAndPort is null. The fix has been tested in dev/preprod environments, and the fix was checked with load testing at around 30 rps.

raman-m commented 5 days ago

Thank you for conducting the tests! Is the official hotfix release package, version 23.3.4, needed to update your all environments and conduct the final bugfix test? Can you update environments with manually built DLLs?

Additionally, I'm going to review the code once more... Having unit/integration tests is likely sufficient to deem the code acceptable.

antikorol commented 5 days ago

Thank you for conducting the tests! Is the official hotfix release package, version 23.3.4, needed to update your all environments and conduct the final bugfix test? Can you update environments with manually built DLLs?

At your discretion, I can release it with my fix for now and remove it in the future.

raman-m commented 5 days ago

Great, Roman! I'd like to merge it into the develop branch, after which you can create a temporary build from your forked repository and deploy it to the Prod environment, okay?

Could you provide more details about your Production environment? Please include information on the hosting server (Docker, IIS, self-hosted), the number of Ocelot instances, CPU & memory resources per Ocelot instance, requests per second, and the technical stack (specific Ocelot features utilized), etc.

Is the Ocelot gateway deployed on a public or private network? Also, could you specify the industry that the Prod product is part of?

antikorol commented 5 days ago

Is the Ocelot gateway deployed on a public or private network?

Unfortunately, the game is still in closed beta testing, and the loads on prod are much times lower than what we simulate during tests. Ocelot is deployed as a regular deployment in k8s. But If I find any issues, I will let you know :)

raman-m commented 5 days ago

Roman, thank you for addressing the code review issues in the commit https://github.com/ThreeMammals/Ocelot/pull/2111/commits/bbed7d1aa195f425939cc5f3a69e245f6bdd565e! I will conduct the final code review shortly and may push some commits... Your patience is appreciated.