ThreeMammals / Ocelot

.NET API Gateway
https://www.nuget.org/packages/Ocelot
MIT License
8.31k stars 1.63k forks source link

Operating memory research #1839

Closed RaynaldM closed 6 months ago

RaynaldM commented 9 months ago

We regularly have a few Out Of Memory, which are very difficult to track down. We have found 2 probable causes. It's not much, but in a high-traffic context, small streams make big rivers.

Take a look at our PR to see the little fixes we've made.

But the hunt is not finish... 😃

raman-m commented 9 months ago

For more lucky hunting I would say you could add more logger points... I agree OOM problems cannot be detected easily now because Ocelot has no memory consuming monitor, indicators and memory events...

raman-m commented 9 months ago

@ggnaegi and @RaynaldM Any ideas on how to monitor such cases and memory management? Seems we need definitely to design some lite memory controller/monitor...

ggnaegi commented 9 months ago

@ggnaegi and @RaynaldM Any ideas on how to monitor such cases and memory management? Seems we need definitely to design some lite memory controller/monitor...

Metrics?

raman-m commented 9 months ago

Does our lovely Bla-bla gateway have some metrics/indicators or monitoring features? We could start from general metrics for memory consumption, but I'm afraid that will be not easy... Seems it requires a refactoring of all Ocelot core... 🤔 I believe for the first time we can read consumed memory on app domain level... That should be enough... More precise metrics/indicators require big refactoring road/milestone...

RaynaldM commented 9 months ago

Our experience with metrics is currently limited to measuring response times, the number of active requests (requests but not yet responses), the number of requests per second and error counts. As far as OOMs are concerned, we see them in the logs, but with a slight delay, and it's quite difficult to trace the thread of events that generated them. In many cases, an OOM is triggered on a method, but it is merely the victim of a leak elsewhere. And to top it all off, OOMs happen quite randomly. I think it's better to concentrate on improving Ocelot gently and punctually, whenever we recognize a problematic piece of code (and there really are lots of places where we can improve it).

RaynaldM commented 9 months ago

As you can see, memory consumption varies greatly from pod to pod, due to the diversity of operations to be carried out.

image

It varies from 150 MB to 1.2 GB.

ks1990cn commented 9 months ago

@RaynaldM Intresting, how you are measuring these graphs? Is it something on Production with azure/aws, which tool it is?

What about just finding through visual studio profilers, is it effective enough (Cause it will not be able to replicate high traffic cases)?

Just gone through statement and trying to understand, what approach you are following.

https://learn.microsoft.com/en-us/visualstudio/profiling/profiling-feature-tour?view=vs-2022

ks1990cn commented 9 months ago
image

Here we can see from 50s to 1:40min we got 2 times allocations, one before GC and after GC. This is done with 100 Virtual Users load for 5 mins. We can through VS revisit our codes where it makes more allocations?

ks1990cn commented 8 months ago

.Net 8 have new feature for realtime GC monitoring,.Net Aspire.

https://learn.microsoft.com/en-us/dotnet/aspire/get-started/aspire-overview https://www.youtube.com/watch?v=DORZA_S7f9w https://github.com/dotnet/eShop

Can we give user option to integrate?

raman-m commented 6 months ago

But the hunt is not finish... 😃

@RaynaldM @ggnaegi Do we need to hunt more?

ggnaegi commented 6 months ago

@raman-m @RaynaldM We should try to improve the overall application performance, but the last changes had a big impact on OOMs. I think this issue isn't specific enough, maybe we should convert it to a discussion?