containous / traefik-extra-service-fabric

Traefik extra: Service Fabric Provider
Apache License 2.0
12 stars 14 forks source link

support rate limiting for the service fabric provider #23

Open AviMualem opened 6 years ago

AviMualem commented 6 years ago

Do you want to request a feature or report a bug?

Feature

What did you do?

Ive tried to configure rate limits for my frontend rule by using the Traefik labels for service fabric.

What did you expect to see?

i expected to be able to use the following annotations in my servicemanifest.xml configuration.

traefik.frontend.rateLimit.extractorFunc
traefik.frontend.rateLimit.rateSet.burst
traefik.frontend.rateLimit.rateSet.period
traefik.frontend.rateLimit.rateSet.average

although the service fabric documentation (https://docs.traefik.io/configuration/backends/servicefabric/) doesn't have these labels on the supported label list i assumed it will be just like the traefik.backend.healthcheck label which is basically supported but for some reason isn't mentioned in the supported labels.

What did you see instead?

i couldnt define a rate limit for my frontend rule. my labels were basically ignored. I noticed that the service fabric template still doesnt support the mentioned annotations,while it looks like the vast majority of the other templates (kubernetes.tmpl,marathon.tmpl etc..) support them. im basing my opinion on the template file which can be found here: https://github.com/containous/traefik-extra-service-fabric/blob/master/servicefabric_tmpl.go

Output of traefik version: (What version of Traefik are you using?)

im using traefik 1.5.2

guya1976 commented 6 years ago

Important feature

guya1976 commented 6 years ago

The SF solution is great but this feature is preventing from us to fully utilize the solution. Estimation will be highly appreciated. Thank you

lawrencegripper commented 6 years ago

Hi,

Currently we have a branch awaiting review for v1.6 which includes additional label support, you are correct - the work included in v1.5 had only a small subset of features. If you spot issues in the docs please submit a PR to fix them up, this is a community effort so any and all help is appreciated.

Throttling/rate limiting is a tough one and we'd really appreciate input on how best to solve it. It is not included in the PR for 1.6 as my assumption was that it requires a cross-instance shared data store for counting requests made. For example, if a limit is broken on one node but the load balancer routes subsequent requests to another node the limit wouldn't be enforced unless the nodes shared data. We have investigated ways to do this using the existing property store API in SF for a similar scenario, ACME support, but haven't been able to emulate the locking/watch necessary.

The first option would be to enable to label without clustering support. Ratelimits would be enforced separately per sf node. This could then be combined with the sticky session support in the Azure Load Balancer to ensure clients didn't move between nodes and get a new allowance.

The second option is to better understand the rate limiting implementation in the traefik code and come up with a way to have a shared data store on SF between all the running instances of Traefik.

Would be great to chat in more detail about this and understand more about your usage, in the Traefik slack there is a channel for service fabric.

CC @jjcollinge

AviMualem commented 6 years ago

First of all, i appreciate the quick response :). I can totally understand the issue with applying rate limiting into the service fabric environment, you are correct. The HTTP requests logs data will have to be available across all nodes in order to properly apply the desired functionality. I have to say i dont see a major value in implementing the feature per node. it can be nice but its defiantly not the desired final functionality

As a concept service fabric provide the option to design a state full application/service which is basically a service that have a shared data structure (in the form of something called reliable dictionary. more data can be found at: https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-reliable-collections ) which is synced across all deployed instances of the service (feature is available out of the box). I can assume that by using this shared and synced data structure it would have been possible to implement the rate limiting issue.

That being said, the Traefik app for service fabric is basically a hosted executable (which essentially is just a reference to and .exe file) so Im not sure how it will embrace the state full service option. Maybe its possible to access the reliable synced data structure with some REST calls. if so, it would be an amazing solution to the issue.

Anyhow, At the end of the day, all deployed instances of a given service will have to be able to access the HTTP calls data, so the first thing that comes to my head is to save the data in an external store which will be accessible to all deployed instances.

Lets say the external store that will host the data will be Redis, a dedicated label could ask for the connection string and just record the HTTP calls to it. By applying it i would expect this kind of labeling config: traefik.frontend.rateLimit.extractorFunc traefik.frontend.rateLimit.rateSet.burst traefik.frontend.rateLimit.rateSet.period traefik.frontend.rateLimit.rateSet.average traefik.frontend.rateLimit.rateSet.ExternalStoreProviderType traefik.frontend.rateLimit.rateSet.ExternalStoreProviderConnectionString

The implemented middleware will query the external store in order to make the decision whether to block or pass the HTTP request.

I will try to check the option of using reliable data structure form a hosted executable based application, and will update if its possible, because i honestly believe its the most elegant solution. If its not possible i guess the best option is my suggestion above.

P.S, i guess you had that challenge in all provider integration, so how did you manage to support it in all the other providers? Thanks, Avi.

jjcollinge commented 6 years ago

Hi @AviMualem - currently the SF reliable state is only accessible via implementing a Reliable Stateful Service and we wanted to avoid having to write and maintain an additional C# service. However, there are some updates coming which might allow us to resolve this differently - we will re-evaluate the options and try to come up with a solution as this is useful for many features (we'll try align with other providers which use a distributed backing store such as etcd.). P.S. @lawrencegripper and myself actually work for Microsoft and not Containous so we're not maintaining any of the other providers.

lawrencegripper commented 6 years ago

In the meantime I'll look at adding the per-node limits as should be a quick bit of work in the existing PR here: #16

AviMualem commented 6 years ago

"currently the SF reliable state is only accessible via implementing a Reliable Stateful Service" - This fact doesn't surprise me :), and i agree its an over kill to implement a C# service for it.

"However, there are some updates coming which might allow us to resolve this differently " - ill be happy if you can share some info :)

I guess i will have to open the source code to see how this feature is supported in other providers. i can assume they all have some sort of a shared store (i guess service fabric doesnt have anything like it, although im pretty sure the image store is accessible from all nodes for deployment purposes.)

AviMualem commented 6 years ago

@jjcollinge

jjcollinge commented 6 years ago

@AviMualem - we'll evaluate what the most natural fit is from both the Traefik and Service Fabric side of things. We'll favour trying to leverage the existing SF replication engine if we can.

AviMualem commented 6 years ago

@jjcollinge sounds like the right choice. thanks for sharing. ill try to be active on the slack channel and hopefully ill be able to help :) .

johnib commented 4 years ago

Hey all, nice discussion you've had back then -- reality says it did not end up with providing the feature. Any update worth sharing on this regard? should we expect anything to change in the coming months?

Much thanks!

lawrencegripper commented 4 years ago

After some attempts we weren’t able to complete this work, currently there is no active development ongoing. Sorry.