ThreeMammals / Ocelot

.NET API Gateway
https://www.nuget.org/packages/Ocelot
MIT License
8.34k stars 1.63k forks source link

After Upgrade to 23.3.3 from 23.2.2 about every 20th - 30th request finished with status code 500 #2110

Closed antikorol closed 1 month ago

antikorol commented 3 months ago

Expected Behavior / New Feature

Resolving a service address to route an API request

Actual Behavior / Motivation for New Feature

Sometimes, about every 20th - 30th request finished with status code 500

requestId: 0HN4ISPBSIV4V:000037FF, previousRequestId: No PreviousRequestId, 
message: 'Exception caught in global error handler, 
exception message: Object reference not set to an instance of an object.,
exception stack:
   at Ocelot.LoadBalancer.LoadBalancers.RoundRobin.Lease(HttpContext httpContext)
   at Ocelot.LoadBalancer.Middleware.LoadBalancingMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.DownstreamPathManipulation.Middleware.ClaimsToDownstreamPathMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.QueryStrings.Middleware.ClaimsToQueryStringMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Headers.Middleware.ClaimsToHeadersMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Authorization.Middleware.AuthorizationMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Claims.Middleware.ClaimsToClaimsMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Authentication.Middleware.AuthenticationMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.RequestId.Middleware.RequestIdMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.RateLimiting.Middleware.RateLimitingMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Request.Middleware.DownstreamRequestInitialiserMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Headers.Middleware.HttpHeadersTransformationMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Security.Middleware.SecurityMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Multiplexer.MultiplexingMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.DownstreamRouteFinder.Middleware.DownstreamRouteFinderMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Responder.Middleware.ResponderMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Errors.Middleware.ExceptionHandlerMiddleware.Invoke(HttpContext httpContext) RequestId: 0HN4ISPBSIV4V:000037FF'

Steps to Reproduce the Problem

{
  "Routes": [
    {
      "UpstreamPathTemplate": "/public/{everything}",
      "UpstreamHttpMethod": [
        "POST"
      ],
      "ServiceName": "serviceName",
      "DownstreamPathTemplate": "/public/{everything}",
      "DownstreamScheme": "http",
      "LoadBalancerOptions": {
        "Type": "RoundRobin"
      }
    }
  ],
  "GlobalConfiguration": {
    "ServiceDiscoveryProvider": {
      "Namespace": "namespace",
      "Type": "Kube"
    }
}
}

Specifications

raman-m commented 3 months ago

Hello Roman! Welcome to Ocelot world! 🐯

It's difficult to determine what happened. Could you please upload the entire solution to GitHub for review? Alternatively, you could attach the following artifacts to this thread:

This information will likely assist in identifying the issue.


After Upgrade to 23.3.3 from 23.2.2 about every 20th - 30th request finished with status code 500

You didn't observe the same errors in the Ocelot logs for version 23.2.x, did you? Are you now operating in two different environments, or did you only upgrade a single environment?

Since you're utilizing Kubernetes, could you provide more details about your K8s server deployment? How do you initiate the server, is it locally or in a cloud remotely? And how long have you been using Kubernetes in conjunction with Ocelot service discovery?

raman-m commented 3 months ago

Regarding the exception details...

requestId: 0HN4ISPBSIV4V:000037FF, previousRequestId: No PreviousRequestId, 
message: 'Exception caught in global error handler, 
exception message: Object reference not set to an instance of an object.,
exception stack:
   at Ocelot.LoadBalancer.LoadBalancers.RoundRobin.Lease(HttpContext httpContext)
   at Ocelot.LoadBalancer.Middleware.LoadBalancingMiddleware.Invoke(HttpContext httpContext)

Are you a C# developer? Can you debug C# code? Do you use the official Ocelot NuGet package?

Deploying the Release version of the Ocelot package DLLs means there is no debug information for the called lines in the Call Stack trace. However, we can analyze the code of the RoundRobin.Lease method. https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot/LoadBalancer/LoadBalancers/RoundRobin.cs#L19-L38

It appears the issue may lie here: https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot/LoadBalancer/LoadBalancers/RoundRobin.cs#L32-L33 There's a missing null check for line 33. Theoretically, the service provider could return a null object, which should be accounted for.

To conclude, it seems that for some reason, the Kube service provider and its service builder returned a null list item, leading to the raising of NullReferenceException. Therefore, further investigation into the problem is warranted...

raman-m commented 3 months ago

@antikorol The subsequent step in the investigation involves searching for subsequent messages in Ocelot's log. Here is the service builder logic: https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot.Provider.Kubernetes/KubeServiceBuilder.cs#L22-L33

We need to search for the following message in the log: https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot.Provider.Kubernetes/KubeServiceBuilder.cs#L31

In appsettings.json file the Debug logging level for the application can be defined as follows:

{
  "Logging": {
    "LogLevel": {
      "Default": "Debug",
      "Microsoft.AspNetCore": "Warning"
    }
  }
}

Deploy and review the log for the message:

If this message appears in the log, it indicates that the Kubernetes setup is correct, and we can proceed to investigate the issue more thoroughly.

raman-m commented 3 months ago

every 20th - 30th request finished with status code 500

It seems that the requests to the Kubernetes endpoint were failing around the 20th to 30th attempts, as indicated by the code in these lines: https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot.Provider.Kubernetes/Kube.cs#L34-L36 This suggests that the Kubernetes client connectivity might be unstable. It would be advisable to switch from the Kube to the PollKube provider and observe the behavior again. The error should no longer occur if this change is made. Additionally, the PollingInterval should be set to less than half the average duration between the occurrences of the 500 error.

Let's figure out what's happening and possibly provide a hotfix for your user scenario to eliminate the Null Reference Error logging. We need to handle this 500 status case and process it accordingly. Will you be opening a PR soon?

antikorol commented 3 months ago

1) I am using the official NuGet package. There was no error in version 23.2.2, but I had my own logic for extending the config to specify the service port now I have started using your logic with schema matching with the named port

2) I have cleaned the service code from corporate code, leaving the minimum on which the error definitely reproduces. https://github.com/antikorol/sample-of-error

3) When I make a copy of the RoundRobin class and simply use it in the config as RoundRobin2, I can no longer reproduce this error. (I replaced it this way because I wanted to see in which line the null reference exception appears)

4) When I use PollKube, I get an error in the logs "requestId: 0HN4ISPBSIV51:000023E9, previousRequestId: No PreviousRequestId, message: 'Error Code: ServicesAreEmptyError Message: There were no services in RoundRobin during Lease operation. errors found in ResponderMiddleware"

5) When I changed log level to debug, I was only able to reproduce this exception once

Below is the log with debug information.


  | Jun 24, 2024 @ 15:13:45.018 | requestId: 0HN4K8QDBB2FT:00000016, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'
-- | --

  | Jun 24, 2024 @ 15:13:39.565 | requestId: 0HN4K8QDBB2FO:00000015, previousRequestId: No PreviousRequestId, message: 'Exception caught in global error handler, exception message: Object reference not set to an instance of an object., exception stack:    at Ocelot.LoadBalancer.LoadBalancers.RoundRobin.Lease(HttpContext httpContext)    at Ocelot.LoadBalancer.Middleware.LoadBalancingMiddleware.Invoke(HttpContext httpContext)    at Ocelot.DownstreamPathManipulation.Middleware.ClaimsToDownstreamPathMiddleware.Invoke(HttpContext httpContext)    at Ocelot.QueryStrings.Middleware.ClaimsToQueryStringMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Headers.Middleware.ClaimsToHeadersMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Authorization.Middleware.AuthorizationMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Claims.Middleware.ClaimsToClaimsMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Authentication.Middleware.AuthenticationMiddleware.Invoke(HttpContext httpContext)    at Ocelot.RequestId.Middleware.RequestIdMiddleware.Invoke(HttpContext httpContext)    at Ocelot.RateLimiting.Middleware.RateLimitingMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Request.Middleware.DownstreamRequestInitialiserMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Headers.Middleware.HttpHeadersTransformationMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Security.Middleware.SecurityMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Multiplexer.MultiplexingMiddleware.Invoke(HttpContext httpContext)    at Ocelot.DownstreamRouteFinder.Middleware.DownstreamRouteFinderMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Responder.Middleware.ResponderMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Errors.Middleware.ExceptionHandlerMiddleware.Invoke(HttpContext httpContext) RequestId: 0HN4K8QDBB2FO:00000015'

  | Jun 24, 2024 @ 15:13:39.515 | requestId: 0HN4K8QDBB2FU:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.514 | requestId: 0HN4K8QDBB2FU:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.514 | requestId: 0HN4K8QDBB2FQ:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.512 | requestId: 0HN4K8QDBB2FP:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.512 | requestId: 0HN4K8QDBB2FO:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.512 | requestId: 0HN4K8QDBB2FO:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.512 | requestId: 0HN4K8QDBB2FP:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.511 | requestId: 0HN4K8QDBB2FQ:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.506 | requestId: 0HN4K8QDBB2FR:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.506 | requestId: 0HN4K8QDBB2FR:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.505 | requestId: 0HN4K8QDBB2FT:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.505 | requestId: 0HN4K8QDBB2FT:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.505 | requestId: 0HN4K8QDBB2FN:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.505 | requestId: 0HN4K8QDBB2FN:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.504 | requestId: 0HN4K8QDBB2FS:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.504 | requestId: 0HN4K8QDBB2FS:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.209 | requestId: 0HN4K8QDBB2FR:00000014, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'
raman-m commented 3 months ago

Dear Roman, What's your full name? What's your LinkedIn? We can continue discussion in LinkedIn chat...

antikorol commented 3 months ago

@raman-m https://www.linkedin.com/in/roman-shevchik/

raman-m commented 3 months ago

1) I am using the official NuGet package. There was no error in version 23.2.2, but I had my own logic for extending the config to specify the service port now I have started using your logic with schema matching with the named port

Please be aware that any custom logic may compromise stability, and the Ocelot team cannot be held accountable for it. It appears your existing custom logic is only compatible with version 23.2. I guess, you might be facing a Kubernetes misconfiguration issue related to the Downstream Scheme versus Port Names feature. Are you utilizing this feature?

It would be helpful to examine the service definition in your Kubernetes configuration. Could you provide it, please?

antikorol commented 3 months ago

No, I removed that logic because your changes already cover the case I needed with multiple ports.

raman-m commented 3 months ago

2) I have cleaned the service code from corporate code, leaving the minimum on which the error definitely reproduces. https://github.com/antikorol/sample-of-error

Good! After a quick review:

I wonder that your Ocelot app started at all! Did you develop the Ocelot app in the past using version 23.2 or someone else?

Did you read Configuration docs? Especially, the Multiple Environments paragraph which say how to configure app with .AddJsonFile("ocelot.json") // primary config file

So, I don't see configuration part in these lines L27-L42You have to setup configuration:

    builder.WebHost
        .ConfigureAppConfiguration((context, config) =>
        {
            var env = context.HostingEnvironment;
            config.SetBasePath(env.ContentRootPath)
                .AddJsonFile("appsettings.json", true, true)
                .AddJsonFile($"appsettings.{env.EnvironmentName}.json", true, true)
                .AddOcelot(env) // happy path
                .AddEnvironmentVariables();
        })

So, it's very important to configure via AddOcelot method:

 .ConfigureAppConfiguration((context, config) => config.AddOcelot())

This is minimal required configuring code! But better to use official template from docs.

raman-m commented 3 months ago

@antikorol commented:

No, I removed that logic because your changes already cover the case I needed with multiple ports.

Good! You wanna use the feature: Downstream Scheme vs Port Names

Also you enabled the port finder via "DownstreamScheme": "http". Now, are you sure your K8s definition is correct? Should we review K8s config together?

antikorol commented 3 months ago

There was update from 18.0.0 to 23.2.2 a long time ago and now to 23.3.3 I had a configuration in a YAML file, but to exclude all possible libraries, I converted it to JSON and added it to the appsettings, and yes, everything worked. Now I've moved it to a separate ocelot.json and fixed registration

According to the Microsoft documentation, it is not mandatory to explicitly specify appsettings.json or its variations depending on the environment. https://learn.microsoft.com/en-us/aspnet/core/fundamentals/configuration/?view=aspnetcore-6.0

Should we review K8s config together?

yes, what I should provide?

raman-m commented 3 months ago

3) When I make a copy of the RoundRobin class and simply use it in the config as RoundRobin2, I can no longer reproduce this error.

I apologize, but I see absolutely no difference from the official code. There's no difference at all; every line is exactly the same. I don't understand what you're referring to.

(I replaced it this way because I wanted to see in which line the null reference exception appears)

I apologize, but uncommenting line L53 does not result in any replacement. You need to implement actual service injection using the official helpers: Custom Load Balancers. Have you consulted this documentation?

Even if you substitute the services with custom ones, it won't aid in identifying the failing line, as this necessitates the inclusion of Debug version DLLs in your Docker file, as seen here: L21. Therefore, you must build with the debug configuration -c Debug to ensure the log contains Call Stack code lines within messages for exceptions.

raman-m commented 3 months ago

@antikorol commented:

yes, what I should provide?

Please provide the Kubernetes service definition in text, JSON format, or the Kubernetes client response JSON. From the log you've shared, it appears you have defined only one service; however, the number of ports in the definition is unclear.

Additionally, I'm curious as to why every twentieth Kubernetes client request fails or returns an invalid response, causing the Ocelot provider to throw NullReferenceException. We need to pinpoint this faulty request and obtain the response data in JSON format to replicate the user scenario through tests.

Lastly, I trust that you have the expertise to debug the provider, identify the issue, and potentially write tests or even submit a pull request with a fix.

antikorol commented 3 months ago

I apologize, but I see absolutely no difference from the official code. There's no difference at all; every line is exactly the same.

Sorry for the confusion. The files are identical. I didn't want to change the project build type. In the release build, I can see lines in the stack trace in my code. I wanted to understand in which exact line the error occurred so that I could add logs later. Surprisingly, the error didn't reproduce when I used just a copy

raman-m commented 3 months ago

I am awaiting a PR that will resolve the issue... At present, I cannot replicate your user scenario as there are no reproduction steps provided. Writing these steps is challenging because everything appears to function correctly for the first 29 requests, and only the 30th request fails with an exception. This is integration issue with invalid data from integrated service which is K8s client. Since I cannot reproduce the issue and lack access to your actual production environment, emulating your production setting is also difficult. Please pinpoint the issue and return to us with a bugfix PR that better illustrates the problem. Best of luck!

antikorol commented 3 months ago

@raman-m I haven't still found 100% reproduction case, but I delved deeper into the code and noticed one detail. https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot.Provider.Kubernetes/Kube.cs#L38 This class is registered as a singleton for each ServiceProviderConfiguration, and the class that resolves it, RoundRobinCreator is also registered as a singleton. At the same time, we use the Kube.GetAsync method call in async code, where we can easily encounter a "Collection was modified" exception. Maybe somewhere deep in the List.Clear or elsewhere, due to resize or clear array, we might lost references and encounter an "Object reference not set to an instance..." exception. What do you think about this part of code?

raman-m commented 3 months ago

This class is registered as a singleton for each ServiceProviderConfiguration, and the class that resolves it

Look at the actual registration method https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot.Provider.Kubernetes/OcelotBuilderExtensions.cs#L9-L17 So, the Kube class is not registered at all. Only KubernetesProviderFactory.Get delegate is registered: https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot.Provider.Kubernetes/OcelotBuilderExtensions.cs#L13 Kube is instantiated on every request. The logic is quite complicated. And I think Kube is scoped service. Also, IKubeApiClient is scoped registered service. The problem could be bad/empty response from IKubeApiClient


RoundRobinCreator is also registered as a singleton.

The issue seems to be with the load balancing logic, which involves a singleton RoundRobin service that is poorly synchronized. The RoundRobin is receiving an unsynchronized IServiceDiscoveryProvider object. Furthermore, the Lease method invokes an asynchronous method that may return lists of varying lengths, which I believe is problematic. Additionally, the exception call stack pinpoints the Lease method as the source of exceptions. It would be prudent to rewrite the Lease method to operate synchronously and stabilize the list length immediately.

What's happening in my opinion. 1st thread gets the list of 1 item (count == 1) 2nd thread gets list of 1 null-item (count == 1) 1st thread goes inside of the if-block considering the list is not empty 2nd thread has finished calling async call of await _servicesDelegate?.Invoke() which returns list of null items 2nd thread re-initializes services variable with list of null items 1st thread goes to the line 32 and reads null-object 1st thread read data on line 33 from null object and the exception is generated

I believe the problem is on the line 33 https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot/LoadBalancer/LoadBalancers/RoundRobin.cs#L33 It must be rewritten!

antikorol commented 3 months ago

https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot/LoadBalancer/LoadBalancers/RoundRobin.cs#L33

You are right that problem at this row. But the main problem in concurrent issue that I described before in the class Kube because the collection of services is declared at the class field level. https://github.com/ThreeMammals/Ocelot/blob/8c0180a6461984ebd11ceca5544ca9d0e1564975/src/Ocelot.Provider.Kubernetes/Kube.cs#L29

I spent a lot of time for confirming the issue, I hope it will be enough. I added debug logs and can see that after clearing the list, sometimes after inserting one service, the list becames with a size of 2.

image

The second element of list is null I can see it in class RoundRoubin

image

Example of logs: Kube class where I can see that captured list and _services list are different:

public virtual async Task<List<Ocelot.Values.Service>> GetAsync()
{
        var endpoint = await _kubeApi
           .ResourceClient(client => new EndPointClientV1(client))
           .GetAsync(_configuration.KeyOfServiceInK8s, _configuration.KubeNamespace);

        _services.Clear();
        if (endpoint?.Subsets.Count != 0)
        {
            var array = BuildServices(_configuration, endpoint!);

            _services.AddRange(array);

            var captured = new List<Ocelot.Values.Service>(array);

            _logger.LogWarning(() => $"RoundRobin2: Kube.GetAsync;"
                + $"_services.Count: {_services.Count}; captured.Count: {captured.Count};"
                + $" Services: {string.Join(Environment.NewLine, captured.Select(s => s is null ? "null" : JsonConvert.SerializeObject(s)))}");
        }
        else
        {
            _logger.LogWarning(() => $"K8s Namespace:{_configuration.KubeNamespace}, Service:{_configuration.KeyOfServiceInK8s}; Unable to use: it is invalid. Address must contain host only e.g. localhost and port must be greater than 0!");
        }

        return _services;
}

If the logs aren't entirely clear, I created a code sample that shows if the list is modified from different threads, we can get null where it shouldn't be. In the sample, I always add one element to the list, but in different threads, and I end up with the list size being more than 1 and the second element is null. And we don't get an exeption that the collection was modified. Example is here https://github.com/antikorol/incorrect-list-usage-sample

raman-m commented 3 months ago

@antikorol Roman, This code analysis and research are excellent, thank you! I concur that the Kube design warrants a review, and synchronization issues need resolution. It appears that the forthcoming refactoring may alter between 2 to 5 classes in the K8s provider. Following your report, it seems the Kube instance is acting as a singleton, likely due to the storage of provider delegates in a static field.

If you open a fix PR, I will prepare an additional hotfix release. However, I'm concerned that we need to refactor the Kube provider logic, which may require some time to develop smoothly.

FYI, I've included this bug issue in the Summer'24 milestone. This represents the least favorable delivery scenario. I am optimistic that we will manage to deliver the fix within a week or two, or alternatively, we can assign the issue to the Annual'23 release which is early bird aka current release.

raman-m commented 3 months ago

Tasks

antikorol commented 3 months ago

A quick fix is to move the creation of the List from the class fields to a method. This will increase allocations, but it will help avoid exception in high-load APIs. A memory-optimized solution can be written or refactored later.

raman-m commented 3 months ago

Would you be interested in contributing by submitting a PR? If so, please fork the repository into your account.

antikorol commented 3 months ago

Hello, Raman @raman-m I've opened the PR

raman-m commented 3 months ago

Thank you, Roman! To understand the subsequent steps, please review our Dev Process.

raman-m commented 2 months ago

Could you provide the response data or JSON that your client returns, please? Understanding the actual return values of these lines would be helpful:

     var endpoint = await _kubeApi
           .ResourceClient(client => new EndPointClientV1(client))
           .GetAsync(_configuration.KeyOfServiceInK8s, _configuration.KubeNamespace);
antikorol commented 2 months ago

Could you provide the response data or JSON that your client returns, please? Understanding the actual return values of these lines would be helpful:

     var endpoint = await _kubeApi
           .ResourceClient(client => new EndPointClientV1(client))
           .GetAsync(_configuration.KeyOfServiceInK8s, _configuration.KubeNamespace);
{
   "kind":"Endpoints",
   "apiVersion":"v1",
   "metadata":{
      "name":"fake-service",
      "namespace":"fake-namespace-dev",
      "uid":"a00000a0-a0aa-0000-0000-0a00aaa0a00a",
      "resourceVersion":"111111",
      "creationTimestamp":"2023-11-01T13:07:32Z",
      "labels":{
         "app.kubernetes.io/instance":"fake-service",
         "app.kubernetes.io/managed-by":"Helm",
         "app.kubernetes.io/name":"fake-service",
         "app.kubernetes.io/version":"1.0.1",
         "helm.sh/chart":"fake-service-0.1.0"
      },
      "managedFields":[
         {
            "manager":"kube-controller-manager",
            "operation":"Update",
            "apiVersion":"v1",
            "time":"2024-07-11T11:04:39Z",
            "fieldsType":"FieldsV1",
            "fieldsV1":{
               "f:metadata":{
                  "f:labels":{
                     ".":{

                     },
                     "f:app.kubernetes.io/instance":{

                     },
                     "f:app.kubernetes.io/managed-by":{

                     },
                     "f:app.kubernetes.io/name":{

                     },
                     "f:app.kubernetes.io/version":{

                     },
                     "f:helm.sh/chart":{

                     }
                  }
               },
               "f:subsets":{

               }
            }
         }
      ]
   },
   "subsets":[
      {
         "addresses":[
            {
               "ip":"111.22.33.10",
               "nodeName":"ip-111.22.33.10.eu-central-1.compute.internal",
               "targetRef":{
                  "kind":"Pod",
                  "namespace":"fake-namespace-dev",
                  "name":"fake-service-ddsadasd-1111",
                  "uid":"a00000a0-a0aa-0000-0000-0a00aaa0a00a"
               }
            },
            {
               "ip":"111.33.33.11",
               "nodeName":"ip-111.33.33.11.eu-central-1.compute.internal",
               "targetRef":{
                  "kind":"Pod",
                  "namespace":"fake-namespace-dev",
                  "name":"fake-service-ddsadasd-2222",
                  "uid":"a00000a0-a0aa-0000-0000-0a00aaa0a00a"
               }
            }
         ],
         "ports":[
            {
               "name":"http",
               "port":80,
               "protocol":"TCP"
            }
         ]
      }
   ]
}
raman-m commented 2 months ago

Thank you for the JSON samples! I will utilize them to write an acceptance test.

raman-m commented 2 months ago

Roman, I have the next question regarding load balancing in your project... Why does your team use Kubernetes for service registrations? Is the project big enough? Using containerization tools makes sense in case of large projects, in a cloud to deploy containerization tools like Kubernetes or Consul.

If the project is small and the number of downstream services is constant (with all service IPs known), you can use static service definitions in the DownstreamHostAndPorts route options and manually load balance, as described in this documentation section:

So, it's just a question of...

antikorol commented 2 months ago

In my opinion, Kubernetes is a good choice for a startup/initial stage of the project as a simple and quick way to implement service discovery (it can be a five-minute job). In the future, after analyzing the load, you can implement another solution.

Using a static IP for a pod, in my view, is an anti-pattern. Using pods IPs in static configuration is a bad idea because, in Kubernetes, there are many ways when IP address will be changed: deployment, scaling, and even simply rescheduling the pod to a different node by Kubernetes itself.

At the same time, to reduce the load on Kubernetes, I can simply use the PollKube. However, unfortunately, this bug affects this provider too, but this fix resolves it. :)

Is the project big enough?

I hope it will be huge in the future :D, but even now there is scaling during peak load hours.

raman-m commented 2 months ago

@antikorol Hi Roman! I was able to reproduce the problem in this branch: raman-m/2110-steps-to-reproduce → the KubernetesServiceDiscoveryTests class. You need to run ShouldReturnServicesFromK8s_HavingHighLoadOnTheProviderAndRoundRobinBalancer tests.

Be aware that there are stable and unstable theory cases; however, both exhibit the 500 status issue. To reveal the 500 error in the green test, you must use the "Run Until Failed" option. Additionally, both cases, when in Debug mode, might display the exception under discussion, because debug mode show exceptions being thrown. I'm going to reuse my research results in the fixing PR...

antikorol commented 2 months ago

@raman-m Hello Raman Thank you for acceptance tests I've tried to run this test, but I have the error (on macOS)

System.Net.Sockets.SocketException: Can't assign requested address

This is a known issue for macOS. I tried a couple of suggestions from the internet, but they didn't help, and many of the suggestions are such that I wouldn't want to implement on my work laptop.

I see that your CI failed on this test case. Can I simply cherry-pick the commit and check it on your CI?

raman-m commented 2 months ago

We use Windows and Linux only for now. Don't use MacOS please. CircleCI runs builds using Linux env. Now the test is green

raman-m commented 2 months ago

I've tried to run this test, but I have the error (on macOS)

System.Net.Sockets.SocketException: Can't assign requested address

This is a known issue for macOS. I tried a couple of suggestions from the internet, but they didn't help, and many of the suggestions are such that I wouldn't want to implement on my work laptop.

@antikorol, you've highlighted an important aspect concerning MacOS builds. We will soon establish automated Windows CI-CD builds as certain features of Ocelot necessitate it. Linux builds satisfy 99.9% of our requirements, and the .NET SDK versus Linux is highly stable since Microsoft conducts extensive testing of the .NET SDK on Linux OS. I am in the process of ascertaining whether the issue lies with the .NET SDK on MacOS or if it is an issue unique to Ocelot?

antikorol commented 2 months ago

@raman-m Loopback does not work on macOS. If I want to use addresses 127.0.0.2+, I need to manually configure the alias.

https://github.com/ThreeMammals/Ocelot/blob/raman-m/2110-steps-to-reproduce/test/Ocelot.AcceptanceTests/Steps.cs#L69

If I don't use loopback here, the tests pass (the method is called with different ports).

raman-m commented 2 months ago

Understood! As a MacOS user, you can suggest a directive preprocessor correction to ensure this helper runs successfully in MacOS environments. MacOS machines should indeed have localhost as an embedded local IP. Have you investigated this issue thoroughly? Could you run the Terminal command ping localhost or an equivalent and share the Terminal output, please? Unfortunately, I do not have access to any MacOS machines.

Loopback does not work on macOS. If I want to use addresses 127.0.0.2+, I need to manually configure the alias.

You can manually configure aliases such as localhost1 and localhost2 in macOS; however, this method is not suitable for automated scripts.

antikorol commented 2 months ago

I encountered this problem many years ago, and it seems that nothing has changed since then. I've attached a screenshot of the ping results.

image

I solved this problem last time by using the command for each needed ips

sudo ifconfig lo0 alias 127.0.0.2 up

There might be a more elegant solution, but I didn't have free time to investigate further.

raman-m commented 2 months ago

@antikorol commented on July 11, 2024

Hello! I have a few questions regarding the response JSON from Kubernetes when a service instance goes offline:

Links to related Kubernetes documentation would be appreciated.

antikorol commented 2 months ago

@antikorol commented on July 11, 2024

Hello. Kubernetes docs: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.30/#read-operations-endpoints-v1-core

{
  "kind": "Endpoints",
  "apiVersion": "v1",
   "metadata":{
    "name":"fake-service",
    "namespace":"fake-namespace-dev",
    "uid":"a00000a0-a0aa-0000-0000-0a00aaa0a00a",
    "resourceVersion":"111111",
    "creationTimestamp": "2023-11-01T13:01:12Z",
    "labels": {
      "app.kubernetes.io/instance":"fake-service",
      "app.kubernetes.io/managed-by":"Helm",
      "app.kubernetes.io/name":"fake-service",
      "app.kubernetes.io/version":"1.0.0",
      "helm.sh/chart":"fake-service-0.1.0"
    },
    "managedFields": [
      {
        "manager": "kube-controller-manager",
        "operation": "Update",
        "apiVersion": "v1",
        "time": "2024-07-30T12:31:35Z",
        "fieldsType": "FieldsV1",
        "fieldsV1": {
          "f:metadata": {
            "f:labels": {
              ".": {},
              "f:app.kubernetes.io/instance": {},
              "f:app.kubernetes.io/managed-by": {},
              "f:app.kubernetes.io/name": {},
              "f:app.kubernetes.io/version": {},
              "f:helm.sh/chart": {}
            }
          }
        }
      }
    ]
  }
}
raman-m commented 2 months ago

If the service has been deleted, return a status code 404 with the following response:

If the service has not been deleted but has gone offline, Kubernetes clients should still return one or more service instances as addresses in subsets. If there were two instances and one went offline without being deleted, the question is: What happens to the list of services in the JSON response? Will it still be present or marked with a special tag, annotation, etc.?

From the details of your issue and the logs provided, it appears there was one instance online. For some reason, this instance went offline, resulting in an empty JSON addresses/subsets list, causing Ocelot's Kube provider to behave incorrectly at the load balancer level. If the Kubernetes endpoint removes offline instances from the list, then we have one testing scenario. If the Kubernetes endpoint does not remove offline instances but marks them with tags/metadata, then we have a different testing scenario. Currently, I am testing the first scenario, assuming that offline services are removed from the list in the subset. I hope this reflects the real-world behavior of the Kubernetes endpoint. However, I understand that DevOps can configure orchestration differently.

antikorol commented 2 months ago

A service cannot be offline in Kubernetes; it may have a list of pods equal to 0. Only pods that are in the 'Running' state are included in the subset list. Even when a pod starts to delete (state 'Terminating'), it will no longer be considered available (will not present in subset list) Maybe I just misunderstood you.

From the details of your issue and the logs provided, it appears there was one instance online

I provided information on the service that has the fewest pods, to reduce logs. The error is also reproduced on services with more than 10 pods.

⚠️ As I mentioned earlier, somewhere at the beginning of our discussion, the main issue here is modifying the list from different threads (clearing and adding). It doesn't depend on the number of elements, whether it's 1 or 5. If you look at the example I shared earlier, you'll see a lot of interesting behaviors when modifying from different threads. Specifically, for an example with 5 services:

Here is example with results (Five valid elements are always added to the list!):

image

The code commited to Git contains commented-out code for serialization and copying of the list, as this significantly increases the number of exceptions related to the collection being modified (accessing an element by index does not return such an exception, which often leads to cases with null elements)

raman-m commented 1 month ago

Hello Roman, The hotfix has been merged, and we are ready to begin the testing phase for this development version. As I understand, your project has not yet been moved to Production, but you do have a Staging/Testing environment. Could you please compile the development version manually, update the project reference, and commence testing with logging enabled at the Debug level? The official release of the hotfix will occur once the remaining tasks in the milestone are completed, which I anticipate will be by September'24. So, let me know testing results in your env in the future.


⚠️ As I mentioned earlier, somewhere at the beginning of our discussion, the main issue here is modifying the list from different threads (clearing and adding). It doesn't depend on the number of elements, whether it's 1 or 5. If you look at the example I shared earlier, you'll see a lot of interesting behaviors when modifying from different threads. Specifically, for an example with 5 services:

While running acceptance tests in a local testing environment, I observed an increased likelihood of encountering 404/500 status errors as more services went offline. This issue may stem from the heavy load on the IKubeApiClient integration. I managed to stabilize the interaction between RoundRobin and Kube by implementing a Retry pattern within the Kube, which appears to have eliminated the 404 status errors when RoundRobin fails after receiving an empty list of services. However, Ocelot sometimes returns 500/502 status errors when the 'Run Until Failure' test command is executed in Visual Studio. https://github.com/ThreeMammals/Ocelot/blob/19a8e2f8b3e773fbe962a56fc2e7067b6b19132b/test/Ocelot.AcceptanceTests/ServiceDiscovery/KubernetesServiceDiscoveryTests.cs#L149 The root cause remains unidentified, necessitating further testing in production environments, as my tests were only conducted locally. In conclusion, the Kube provider seems unstable under heavy loads with an error ratio of approximately 1/500, suggesting that the issue may still lie within the IKubeApiClient integration.

P.S.

Consider switching to the PollKube provider, which reduces the load on the IKubeApiClient and the Kubernetes instance significantly.