Closed antikorol closed 1 month ago
Hello Roman! Welcome to Ocelot world! 🐯
It's difficult to determine what happened. Could you please upload the entire solution to GitHub for review? Alternatively, you could attach the following artifacts to this thread:
serviceName
configurationappsettings.json
This information will likely assist in identifying the issue.
After Upgrade to 23.3.3 from 23.2.2 about every 20th - 30th request finished with status code 500
You didn't observe the same errors in the Ocelot logs for version 23.2.x, did you? Are you now operating in two different environments, or did you only upgrade a single environment?
Since you're utilizing Kubernetes, could you provide more details about your K8s server deployment? How do you initiate the server, is it locally or in a cloud remotely? And how long have you been using Kubernetes in conjunction with Ocelot service discovery?
Regarding the exception details...
requestId: 0HN4ISPBSIV4V:000037FF, previousRequestId: No PreviousRequestId,
message: 'Exception caught in global error handler,
exception message: Object reference not set to an instance of an object.,
exception stack:
at Ocelot.LoadBalancer.LoadBalancers.RoundRobin.Lease(HttpContext httpContext)
at Ocelot.LoadBalancer.Middleware.LoadBalancingMiddleware.Invoke(HttpContext httpContext)
Are you a C# developer? Can you debug C# code? Do you use the official Ocelot NuGet package?
Deploying the Release version of the Ocelot package DLLs means there is no debug information for the called lines in the Call Stack trace. However, we can analyze the code of the RoundRobin.Lease
method.
https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot/LoadBalancer/LoadBalancers/RoundRobin.cs#L19-L38
It appears the issue may lie here:
https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot/LoadBalancer/LoadBalancers/RoundRobin.cs#L32-L33
There's a missing null
check for line 33. Theoretically, the service provider could return a null
object, which should be accounted for.
To conclude, it seems that for some reason, the Kube
service provider and its service builder returned a null
list item, leading to the raising of NullReferenceException
.
Therefore, further investigation into the problem is warranted...
@antikorol The subsequent step in the investigation involves searching for subsequent messages in Ocelot's log. Here is the service builder logic: https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot.Provider.Kubernetes/KubeServiceBuilder.cs#L22-L33
We need to search for the following message in the log: https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot.Provider.Kubernetes/KubeServiceBuilder.cs#L31
In appsettings.json
file the Debug
logging level for the application can be defined as follows:
{
"Logging": {
"LogLevel": {
"Default": "Debug",
"Microsoft.AspNetCore": "Warning"
}
}
}
Deploy and review the log for the message:
If this message appears in the log, it indicates that the Kubernetes setup is correct, and we can proceed to investigate the issue more thoroughly.
every 20th - 30th request finished with status code 500
It seems that the requests to the Kubernetes endpoint were failing around the 20th to 30th attempts, as indicated by the code in these lines: https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot.Provider.Kubernetes/Kube.cs#L34-L36
This suggests that the Kubernetes client connectivity might be unstable. It would be advisable to switch from the Kube to the PollKube provider and observe the behavior again. The error should no longer occur if this change is made. Additionally, the PollingInterval
should be set to less than half the average duration between the occurrences of the 500 error.
Let's figure out what's happening and possibly provide a hotfix for your user scenario to eliminate the Null Reference Error logging. We need to handle this 500 status case and process it accordingly. Will you be opening a PR soon?
1) I am using the official NuGet package. There was no error in version 23.2.2, but I had my own logic for extending the config to specify the service port now I have started using your logic with schema matching with the named port
2) I have cleaned the service code from corporate code, leaving the minimum on which the error definitely reproduces. https://github.com/antikorol/sample-of-error
3) When I make a copy of the RoundRobin class and simply use it in the config as RoundRobin2, I can no longer reproduce this error. (I replaced it this way because I wanted to see in which line the null reference exception appears)
4) When I use PollKube, I get an error in the logs "requestId: 0HN4ISPBSIV51:000023E9, previousRequestId: No PreviousRequestId, message: 'Error Code: ServicesAreEmptyError Message: There were no services in RoundRobin during Lease operation. errors found in ResponderMiddleware"
5) When I changed log level to debug, I was only able to reproduce this exception once
Below is the log with debug information.
| Jun 24, 2024 @ 15:13:45.018 | requestId: 0HN4K8QDBB2FT:00000016, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'
-- | --
| Jun 24, 2024 @ 15:13:39.565 | requestId: 0HN4K8QDBB2FO:00000015, previousRequestId: No PreviousRequestId, message: 'Exception caught in global error handler, exception message: Object reference not set to an instance of an object., exception stack: at Ocelot.LoadBalancer.LoadBalancers.RoundRobin.Lease(HttpContext httpContext) at Ocelot.LoadBalancer.Middleware.LoadBalancingMiddleware.Invoke(HttpContext httpContext) at Ocelot.DownstreamPathManipulation.Middleware.ClaimsToDownstreamPathMiddleware.Invoke(HttpContext httpContext) at Ocelot.QueryStrings.Middleware.ClaimsToQueryStringMiddleware.Invoke(HttpContext httpContext) at Ocelot.Headers.Middleware.ClaimsToHeadersMiddleware.Invoke(HttpContext httpContext) at Ocelot.Authorization.Middleware.AuthorizationMiddleware.Invoke(HttpContext httpContext) at Ocelot.Claims.Middleware.ClaimsToClaimsMiddleware.Invoke(HttpContext httpContext) at Ocelot.Authentication.Middleware.AuthenticationMiddleware.Invoke(HttpContext httpContext) at Ocelot.RequestId.Middleware.RequestIdMiddleware.Invoke(HttpContext httpContext) at Ocelot.RateLimiting.Middleware.RateLimitingMiddleware.Invoke(HttpContext httpContext) at Ocelot.Request.Middleware.DownstreamRequestInitialiserMiddleware.Invoke(HttpContext httpContext) at Ocelot.Headers.Middleware.HttpHeadersTransformationMiddleware.Invoke(HttpContext httpContext) at Ocelot.Security.Middleware.SecurityMiddleware.Invoke(HttpContext httpContext) at Ocelot.Multiplexer.MultiplexingMiddleware.Invoke(HttpContext httpContext) at Ocelot.DownstreamRouteFinder.Middleware.DownstreamRouteFinderMiddleware.Invoke(HttpContext httpContext) at Ocelot.Responder.Middleware.ResponderMiddleware.Invoke(HttpContext httpContext) at Ocelot.Errors.Middleware.ExceptionHandlerMiddleware.Invoke(HttpContext httpContext) RequestId: 0HN4K8QDBB2FO:00000015'
| Jun 24, 2024 @ 15:13:39.515 | requestId: 0HN4K8QDBB2FU:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'
| Jun 24, 2024 @ 15:13:39.514 | requestId: 0HN4K8QDBB2FU:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'
| Jun 24, 2024 @ 15:13:39.514 | requestId: 0HN4K8QDBB2FQ:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'
| Jun 24, 2024 @ 15:13:39.512 | requestId: 0HN4K8QDBB2FP:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'
| Jun 24, 2024 @ 15:13:39.512 | requestId: 0HN4K8QDBB2FO:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'
| Jun 24, 2024 @ 15:13:39.512 | requestId: 0HN4K8QDBB2FO:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'
| Jun 24, 2024 @ 15:13:39.512 | requestId: 0HN4K8QDBB2FP:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'
| Jun 24, 2024 @ 15:13:39.511 | requestId: 0HN4K8QDBB2FQ:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'
| Jun 24, 2024 @ 15:13:39.506 | requestId: 0HN4K8QDBB2FR:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'
| Jun 24, 2024 @ 15:13:39.506 | requestId: 0HN4K8QDBB2FR:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'
| Jun 24, 2024 @ 15:13:39.505 | requestId: 0HN4K8QDBB2FT:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'
| Jun 24, 2024 @ 15:13:39.505 | requestId: 0HN4K8QDBB2FT:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'
| Jun 24, 2024 @ 15:13:39.505 | requestId: 0HN4K8QDBB2FN:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'
| Jun 24, 2024 @ 15:13:39.505 | requestId: 0HN4K8QDBB2FN:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'
| Jun 24, 2024 @ 15:13:39.504 | requestId: 0HN4K8QDBB2FS:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'
| Jun 24, 2024 @ 15:13:39.504 | requestId: 0HN4K8QDBB2FS:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'
| Jun 24, 2024 @ 15:13:39.209 | requestId: 0HN4K8QDBB2FR:00000014, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'
Dear Roman, What's your full name? What's your LinkedIn? We can continue discussion in LinkedIn chat...
1) I am using the official NuGet package. There was no error in version 23.2.2, but I had my own logic for extending the config to specify the service port now I have started using your logic with schema matching with the named port
Please be aware that any custom logic may compromise stability, and the Ocelot team cannot be held accountable for it. It appears your existing custom logic is only compatible with version 23.2. I guess, you might be facing a Kubernetes misconfiguration issue related to the Downstream Scheme versus Port Names feature. Are you utilizing this feature?
It would be helpful to examine the service definition in your Kubernetes configuration. Could you provide it, please?
No, I removed that logic because your changes already cover the case I needed with multiple ports.
2) I have cleaned the service code from corporate code, leaving the minimum on which the error definitely reproduces. https://github.com/antikorol/sample-of-error
Good! After a quick review:
ocelot.json
file? In other words, where is the primary configuration file for Ocelot?appsettings.json
? The Ocelot configuration should be in ocelot.json
! For configuration guidance: configuration.I wonder that your Ocelot app started at all! Did you develop the Ocelot app in the past using version 23.2 or someone else?
Did you read Configuration docs?
Especially, the Multiple Environments paragraph which say how to configure app with .AddJsonFile("ocelot.json") // primary config file
So, I don't see configuration part in these lines L27-L42 ❗ You have to setup configuration:
builder.WebHost
.ConfigureAppConfiguration((context, config) =>
{
var env = context.HostingEnvironment;
config.SetBasePath(env.ContentRootPath)
.AddJsonFile("appsettings.json", true, true)
.AddJsonFile($"appsettings.{env.EnvironmentName}.json", true, true)
.AddOcelot(env) // happy path
.AddEnvironmentVariables();
})
So, it's very important to configure via AddOcelot
method:
.ConfigureAppConfiguration((context, config) => config.AddOcelot())
This is minimal required configuring code! But better to use official template from docs.
@antikorol commented:
No, I removed that logic because your changes already cover the case I needed with multiple ports.
Good! You wanna use the feature: Downstream Scheme vs Port Names
Also you enabled the port finder via "DownstreamScheme": "http". Now, are you sure your K8s definition is correct? Should we review K8s config together?
There was update from 18.0.0 to 23.2.2 a long time ago and now to 23.3.3 I had a configuration in a YAML file, but to exclude all possible libraries, I converted it to JSON and added it to the appsettings, and yes, everything worked. Now I've moved it to a separate ocelot.json and fixed registration
According to the Microsoft documentation, it is not mandatory to explicitly specify appsettings.json or its variations depending on the environment. https://learn.microsoft.com/en-us/aspnet/core/fundamentals/configuration/?view=aspnetcore-6.0
Should we review K8s config together?
yes, what I should provide?
3) When I make a copy of the RoundRobin class and simply use it in the config as RoundRobin2, I can no longer reproduce this error.
I apologize, but I see absolutely no difference from the official code. There's no difference at all; every line is exactly the same. I don't understand what you're referring to.
(I replaced it this way because I wanted to see in which line the null reference exception appears)
I apologize, but uncommenting line L53 does not result in any replacement. You need to implement actual service injection using the official helpers: Custom Load Balancers. Have you consulted this documentation?
Even if you substitute the services with custom ones, it won't aid in identifying the failing line, as this necessitates the inclusion of Debug version DLLs in your Docker file, as seen here: L21. Therefore, you must build with the debug configuration -c Debug
to ensure the log contains Call Stack code lines within messages for exceptions.
@antikorol commented:
yes, what I should provide?
Please provide the Kubernetes service definition in text, JSON format, or the Kubernetes client response JSON. From the log you've shared, it appears you have defined only one service; however, the number of ports in the definition is unclear.
Additionally, I'm curious as to why every twentieth Kubernetes client request fails or returns an invalid response, causing the Ocelot provider to throw NullReferenceException
. We need to pinpoint this faulty request and obtain the response data in JSON format to replicate the user scenario through tests.
Lastly, I trust that you have the expertise to debug the provider, identify the issue, and potentially write tests or even submit a pull request with a fix.
I apologize, but I see absolutely no difference from the official code. There's no difference at all; every line is exactly the same.
Sorry for the confusion. The files are identical. I didn't want to change the project build type. In the release build, I can see lines in the stack trace in my code. I wanted to understand in which exact line the error occurred so that I could add logs later. Surprisingly, the error didn't reproduce when I used just a copy
I am awaiting a PR that will resolve the issue... At present, I cannot replicate your user scenario as there are no reproduction steps provided. Writing these steps is challenging because everything appears to function correctly for the first 29 requests, and only the 30th request fails with an exception. This is integration issue with invalid data from integrated service which is K8s client. Since I cannot reproduce the issue and lack access to your actual production environment, emulating your production setting is also difficult. Please pinpoint the issue and return to us with a bugfix PR that better illustrates the problem. Best of luck!
@raman-m
I haven't still found 100% reproduction case, but I delved deeper into the code and noticed one detail.
https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot.Provider.Kubernetes/Kube.cs#L38
This class is registered as a singleton for each ServiceProviderConfiguration, and the class that resolves it, RoundRobinCreator
is also registered as a singleton. At the same time, we use the Kube.GetAsync
method call in async code, where we can easily encounter a "Collection was modified" exception. Maybe somewhere deep in the List.Clear
or elsewhere, due to resize or clear array, we might lost references and encounter an "Object reference not set to an instance..." exception.
What do you think about this part of code?
This class is registered as a singleton for each
ServiceProviderConfiguration
, and the class that resolves it
Look at the actual registration method
https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot.Provider.Kubernetes/OcelotBuilderExtensions.cs#L9-L17
So, the Kube
class is not registered at all. Only KubernetesProviderFactory.Get
delegate is registered:
https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot.Provider.Kubernetes/OcelotBuilderExtensions.cs#L13
Kube
is instantiated on every request. The logic is quite complicated. And I think Kube
is scoped service.
Also, IKubeApiClient is scoped registered service.
The problem could be bad/empty response from IKubeApiClient
RoundRobinCreator
is also registered as a singleton.
The issue seems to be with the load balancing logic, which involves a singleton RoundRobin
service that is poorly synchronized. The RoundRobin
is receiving an unsynchronized IServiceDiscoveryProvider
object. Furthermore, the Lease
method invokes an asynchronous method that may return lists of varying lengths, which I believe is problematic. Additionally, the exception call stack pinpoints the Lease
method as the source of exceptions. It would be prudent to rewrite the Lease
method to operate synchronously and stabilize the list length immediately.
What's happening in my opinion.
1st thread gets the list of 1 item (count == 1)
2nd thread gets list of 1 null-item (count == 1)
1st thread goes inside of the if-block considering the list is not empty
2nd thread has finished calling async call of await _servicesDelegate?.Invoke()
which returns list of null
items
2nd thread re-initializes services
variable with list of null
items
1st thread goes to the line 32 and reads null-object
1st thread read data on line 33 from null object and the exception is generated
I believe the problem is on the line 33 https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot/LoadBalancer/LoadBalancers/RoundRobin.cs#L33 It must be rewritten!
You are right that problem at this row. But the main problem in concurrent issue that I described before in the class Kube because the collection of services is declared at the class field level. https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot.Provider.Kubernetes/Kube.cs#L29
I spent a lot of time for confirming the issue, I hope it will be enough. I added debug logs and can see that after clearing the list, sometimes after inserting one service, the list becames with a size of 2.
The second element of list is null I can see it in class RoundRoubin
Example of logs: Kube class where I can see that captured list and _services list are different:
public virtual async Task<List<Ocelot.Values.Service>> GetAsync()
{
var endpoint = await _kubeApi
.ResourceClient(client => new EndPointClientV1(client))
.GetAsync(_configuration.KeyOfServiceInK8s, _configuration.KubeNamespace);
_services.Clear();
if (endpoint?.Subsets.Count != 0)
{
var array = BuildServices(_configuration, endpoint!);
_services.AddRange(array);
var captured = new List<Ocelot.Values.Service>(array);
_logger.LogWarning(() => $"RoundRobin2: Kube.GetAsync;"
+ $"_services.Count: {_services.Count}; captured.Count: {captured.Count};"
+ $" Services: {string.Join(Environment.NewLine, captured.Select(s => s is null ? "null" : JsonConvert.SerializeObject(s)))}");
}
else
{
_logger.LogWarning(() => $"K8s Namespace:{_configuration.KubeNamespace}, Service:{_configuration.KeyOfServiceInK8s}; Unable to use: it is invalid. Address must contain host only e.g. localhost and port must be greater than 0!");
}
return _services;
}
If the logs aren't entirely clear, I created a code sample that shows if the list is modified from different threads, we can get null where it shouldn't be. In the sample, I always add one element to the list, but in different threads, and I end up with the list size being more than 1 and the second element is null. And we don't get an exeption that the collection was modified. Example is here https://github.com/antikorol/incorrect-list-usage-sample
@antikorol Roman,
This code analysis and research are excellent, thank you!
I concur that the Kube
design warrants a review, and synchronization issues need resolution. It appears that the forthcoming refactoring may alter between 2 to 5 classes in the K8s provider. Following your report, it seems the Kube
instance is acting as a singleton, likely due to the storage of provider delegates in a static field.
If you open a fix PR, I will prepare an additional hotfix release. However, I'm concerned that we need to refactor the Kube
provider logic, which may require some time to develop smoothly.
FYI, I've included this bug issue in the Summer'24 milestone. This represents the least favorable delivery scenario. I am optimistic that we will manage to deliver the fix within a week or two, or alternatively, we can assign the issue to the Annual'23 release which is early bird aka current release.
A quick fix is to move the creation of the List from the class fields to a method. This will increase allocations, but it will help avoid exception in high-load APIs. A memory-optimized solution can be written or refactored later.
Would you be interested in contributing by submitting a PR? If so, please fork the repository into your account.
Hello, Raman @raman-m I've opened the PR
Thank you, Roman! To understand the subsequent steps, please review our Dev Process.
Could you provide the response data or JSON that your client returns, please? Understanding the actual return values of these lines would be helpful:
var endpoint = await _kubeApi
.ResourceClient(client => new EndPointClientV1(client))
.GetAsync(_configuration.KeyOfServiceInK8s, _configuration.KubeNamespace);
Could you provide the response data or JSON that your client returns, please? Understanding the actual return values of these lines would be helpful:
var endpoint = await _kubeApi .ResourceClient(client => new EndPointClientV1(client)) .GetAsync(_configuration.KeyOfServiceInK8s, _configuration.KubeNamespace);
{
"kind":"Endpoints",
"apiVersion":"v1",
"metadata":{
"name":"fake-service",
"namespace":"fake-namespace-dev",
"uid":"a00000a0-a0aa-0000-0000-0a00aaa0a00a",
"resourceVersion":"111111",
"creationTimestamp":"2023-11-01T13:07:32Z",
"labels":{
"app.kubernetes.io/instance":"fake-service",
"app.kubernetes.io/managed-by":"Helm",
"app.kubernetes.io/name":"fake-service",
"app.kubernetes.io/version":"1.0.1",
"helm.sh/chart":"fake-service-0.1.0"
},
"managedFields":[
{
"manager":"kube-controller-manager",
"operation":"Update",
"apiVersion":"v1",
"time":"2024-07-11T11:04:39Z",
"fieldsType":"FieldsV1",
"fieldsV1":{
"f:metadata":{
"f:labels":{
".":{
},
"f:app.kubernetes.io/instance":{
},
"f:app.kubernetes.io/managed-by":{
},
"f:app.kubernetes.io/name":{
},
"f:app.kubernetes.io/version":{
},
"f:helm.sh/chart":{
}
}
},
"f:subsets":{
}
}
}
]
},
"subsets":[
{
"addresses":[
{
"ip":"111.22.33.10",
"nodeName":"ip-111.22.33.10.eu-central-1.compute.internal",
"targetRef":{
"kind":"Pod",
"namespace":"fake-namespace-dev",
"name":"fake-service-ddsadasd-1111",
"uid":"a00000a0-a0aa-0000-0000-0a00aaa0a00a"
}
},
{
"ip":"111.33.33.11",
"nodeName":"ip-111.33.33.11.eu-central-1.compute.internal",
"targetRef":{
"kind":"Pod",
"namespace":"fake-namespace-dev",
"name":"fake-service-ddsadasd-2222",
"uid":"a00000a0-a0aa-0000-0000-0a00aaa0a00a"
}
}
],
"ports":[
{
"name":"http",
"port":80,
"protocol":"TCP"
}
]
}
]
}
Thank you for the JSON samples! I will utilize them to write an acceptance test.
Roman, I have the next question regarding load balancing in your project... Why does your team use Kubernetes for service registrations? Is the project big enough? Using containerization tools makes sense in case of large projects, in a cloud to deploy containerization tools like Kubernetes or Consul.
If the project is small and the number of downstream services is constant (with all service IPs known), you can use static service definitions in the DownstreamHostAndPorts
route options and manually load balance, as described in this documentation section:
So, it's just a question of...
In my opinion, Kubernetes is a good choice for a startup/initial stage of the project as a simple and quick way to implement service discovery (it can be a five-minute job). In the future, after analyzing the load, you can implement another solution.
Using a static IP for a pod, in my view, is an anti-pattern. Using pods IPs in static configuration is a bad idea because, in Kubernetes, there are many ways when IP address will be changed: deployment, scaling, and even simply rescheduling the pod to a different node by Kubernetes itself.
At the same time, to reduce the load on Kubernetes, I can simply use the PollKube. However, unfortunately, this bug affects this provider too, but this fix resolves it. :)
Is the project big enough?
I hope it will be huge in the future :D, but even now there is scaling during peak load hours.
@antikorol Hi Roman! I was able to reproduce the problem in this branch: raman-m/2110-steps-to-reproduce → the KubernetesServiceDiscoveryTests class. You need to run ShouldReturnServicesFromK8s_HavingHighLoadOnTheProviderAndRoundRobinBalancer tests.
Be aware that there are stable and unstable theory cases; however, both exhibit the 500 status issue. To reveal the 500 error in the green test, you must use the "Run Until Failed" option. Additionally, both cases, when in Debug mode, might display the exception under discussion, because debug mode show exceptions being thrown. I'm going to reuse my research results in the fixing PR...
@raman-m Hello Raman Thank you for acceptance tests I've tried to run this test, but I have the error (on macOS)
System.Net.Sockets.SocketException: Can't assign requested address
This is a known issue for macOS. I tried a couple of suggestions from the internet, but they didn't help, and many of the suggestions are such that I wouldn't want to implement on my work laptop.
I see that your CI failed on this test case. Can I simply cherry-pick the commit and check it on your CI?
We use Windows and Linux only for now. Don't use MacOS please. CircleCI runs builds using Linux env. Now the test is green
I've tried to run this test, but I have the error (on macOS)
System.Net.Sockets.SocketException: Can't assign requested address
This is a known issue for macOS. I tried a couple of suggestions from the internet, but they didn't help, and many of the suggestions are such that I wouldn't want to implement on my work laptop.
@antikorol, you've highlighted an important aspect concerning MacOS builds. We will soon establish automated Windows CI-CD builds as certain features of Ocelot necessitate it. Linux builds satisfy 99.9% of our requirements, and the .NET SDK versus Linux is highly stable since Microsoft conducts extensive testing of the .NET SDK on Linux OS. I am in the process of ascertaining whether the issue lies with the .NET SDK on MacOS or if it is an issue unique to Ocelot?
@raman-m Loopback does not work on macOS. If I want to use addresses 127.0.0.2+, I need to manually configure the alias.
If I don't use loopback here, the tests pass (the method is called with different ports).
Understood! As a MacOS user, you can suggest a directive preprocessor correction to ensure this helper runs successfully in MacOS environments. MacOS machines should indeed have localhost
as an embedded local IP. Have you investigated this issue thoroughly? Could you run the Terminal command ping localhost
or an equivalent and share the Terminal output, please? Unfortunately, I do not have access to any MacOS machines.
Loopback does not work on macOS. If I want to use addresses 127.0.0.2+, I need to manually configure the alias.
You can manually configure aliases such as localhost1
and localhost2
in macOS; however, this method is not suitable for automated scripts.
I encountered this problem many years ago, and it seems that nothing has changed since then. I've attached a screenshot of the ping results.
I solved this problem last time by using the command for each needed ips
sudo ifconfig lo0 alias 127.0.0.2 up
There might be a more elegant solution, but I didn't have free time to investigate further.
@antikorol commented on July 11, 2024
Hello! I have a few questions regarding the response JSON from Kubernetes when a service instance goes offline:
Links to related Kubernetes documentation would be appreciated.
@antikorol commented on July 11, 2024
Hello. Kubernetes docs: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.30/#read-operations-endpoints-v1-core
subsets
are not returned.{
"kind": "Endpoints",
"apiVersion": "v1",
"metadata":{
"name":"fake-service",
"namespace":"fake-namespace-dev",
"uid":"a00000a0-a0aa-0000-0000-0a00aaa0a00a",
"resourceVersion":"111111",
"creationTimestamp": "2023-11-01T13:01:12Z",
"labels": {
"app.kubernetes.io/instance":"fake-service",
"app.kubernetes.io/managed-by":"Helm",
"app.kubernetes.io/name":"fake-service",
"app.kubernetes.io/version":"1.0.0",
"helm.sh/chart":"fake-service-0.1.0"
},
"managedFields": [
{
"manager": "kube-controller-manager",
"operation": "Update",
"apiVersion": "v1",
"time": "2024-07-30T12:31:35Z",
"fieldsType": "FieldsV1",
"fieldsV1": {
"f:metadata": {
"f:labels": {
".": {},
"f:app.kubernetes.io/instance": {},
"f:app.kubernetes.io/managed-by": {},
"f:app.kubernetes.io/name": {},
"f:app.kubernetes.io/version": {},
"f:helm.sh/chart": {}
}
}
}
}
]
}
}
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "endpoints \"fake-service\" not found",
"reason": "NotFound",
"details": {
"name": "fake-service",
"kind": "endpoints"
},
"code": 404
}
If the service has been deleted, return a status code 404 with the following response:
If the service has not been deleted but has gone offline, Kubernetes clients should still return one or more service instances as addresses in subsets. If there were two instances and one went offline without being deleted, the question is: What happens to the list of services in the JSON response? Will it still be present or marked with a special tag, annotation, etc.?
From the details of your issue and the logs provided, it appears there was one instance online. For some reason, this instance went offline, resulting in an empty JSON addresses/subsets list, causing Ocelot's Kube
provider to behave incorrectly at the load balancer level.
If the Kubernetes endpoint removes offline instances from the list, then we have one testing scenario.
If the Kubernetes endpoint does not remove offline instances but marks them with tags/metadata, then we have a different testing scenario.
Currently, I am testing the first scenario, assuming that offline services are removed from the list in the subset. I hope this reflects the real-world behavior of the Kubernetes endpoint. However, I understand that DevOps can configure orchestration differently.
A service cannot be offline in Kubernetes; it may have a list of pods equal to 0. Only pods that are in the 'Running' state are included in the subset list. Even when a pod starts to delete (state 'Terminating'), it will no longer be considered available (will not present in subset list) Maybe I just misunderstood you.
From the details of your issue and the logs provided, it appears there was one instance online
I provided information on the service that has the fewest pods, to reduce logs. The error is also reproduced on services with more than 10 pods.
⚠️ As I mentioned earlier, somewhere at the beginning of our discussion, the main issue here is modifying the list from different threads (clearing and adding). It doesn't depend on the number of elements, whether it's 1 or 5. If you look at the example I shared earlier, you'll see a lot of interesting behaviors when modifying from different threads. Specifically, for an example with 5 services:
Here is example with results (Five valid elements are always added to the list!):
The code commited to Git contains commented-out code for serialization and copying of the list, as this significantly increases the number of exceptions related to the collection being modified (accessing an element by index does not return such an exception, which often leads to cases with null elements)
Hello Roman, The hotfix has been merged, and we are ready to begin the testing phase for this development version. As I understand, your project has not yet been moved to Production, but you do have a Staging/Testing environment. Could you please compile the development version manually, update the project reference, and commence testing with logging enabled at the Debug level? The official release of the hotfix will occur once the remaining tasks in the milestone are completed, which I anticipate will be by September'24. So, let me know testing results in your env in the future.
⚠️ As I mentioned earlier, somewhere at the beginning of our discussion, the main issue here is modifying the list from different threads (clearing and adding). It doesn't depend on the number of elements, whether it's 1 or 5. If you look at the example I shared earlier, you'll see a lot of interesting behaviors when modifying from different threads. Specifically, for an example with 5 services:
While running acceptance tests in a local testing environment, I observed an increased likelihood of encountering 404/500 status errors as more services went offline. This issue may stem from the heavy load on the IKubeApiClient
integration. I managed to stabilize the interaction between RoundRobin
and Kube
by implementing a Retry pattern within the Kube
, which appears to have eliminated the 404 status errors when RoundRobin
fails after receiving an empty list of services. However, Ocelot sometimes returns 500/502 status errors when the 'Run Until Failure' test command is executed in Visual Studio.
https://github.com/ThreeMammals/Ocelot/blob/19a8e2f8b3e773fbe962a56fc2e7067b6b19132b/test/Ocelot.AcceptanceTests/ServiceDiscovery/KubernetesServiceDiscoveryTests.cs#L149
The root cause remains unidentified, necessitating further testing in production environments, as my tests were only conducted locally. In conclusion, the Kube
provider seems unstable under heavy loads with an error ratio of approximately 1/500, suggesting that the issue may still lie within the IKubeApiClient
integration.
Consider switching to the PollKube provider, which reduces the load on the IKubeApiClient
and the Kubernetes instance significantly.
Expected Behavior / New Feature
Resolving a service address to route an API request
Actual Behavior / Motivation for New Feature
Sometimes, about every 20th - 30th request finished with status code 500
Steps to Reproduce the Problem
Specifications