Rate limit `storage` - Githubissues

ITISFoundation / osparc-simcore

🐼 osparc-simcore simulation framework

https://osparc.io

MIT License

43 stars 26 forks source link

Rate limit `storage` #5952

Open bisgaard-itis opened 1 week ago

bisgaard-itis commented 1 week ago

Is there an existing issue for this?

[X] I have searched the existing issues

Planned changes

Set a hard limit on the number of cloning operations which can be performed in storage to ensure we don't overload production deployments. This issue is completed once the following list of tasks is done:

[ ] Set hard limit (env var) on the number cloning operations which can be run at once
[ ] Return a 429 with meaningful detail in case this limit is exceeded.
[ ] Propagate changes to other "affected" services (e.g. sidecars and api-server). Note api-server needs to return a meaningful detail.

bisgaard-itis commented 3 days ago

@wvangeit @sanderegg @pcrespov here is how I will approach this. Let me know if you disagree with this approach.

After thinking a bit about this I think the best way to approach this would be to limit the clone_study endpoint in the api-server instead of doing it in storage. Here are some thoughts on this:

With this approach other services which communicate with storage will not be affected (e.g. the dynamic sidecar). This makes the change easier to implement. Of course the price to pay is that the limit is not hard: People can circumvent it by using the frontend.
I would like to do this directly in traefik. However, as far as I can tell from the traefik docs, the only way I can limit the number of in-flight requests to a service via traefik is by using https://doc.traefik.io/traefik/middlewares/http/inflightreq/ which does not allow me to specify an endpoint in the api-server. I.e. this limits the total in-flight requests to the api-server which is probably not what we want. I.e. I am left with doing it directly in the api-server. The downside of that is of course that the limit will be per replica. I.e. increasing the number of replicas of the api-server will increase the limit. But I feel we can live with that limitation.
The best package for doing rate limiting in a fastapi app (as far as I can tell) is https://github.com/laurents/slowapi which also doesn't seem to limit the number of in-flight requests to a given endpoint, but only rate limit (e.g. 10/sec).
Hence, I am left with handling this manually in the api-server. This should be very easy to do using an asyncio.Semaphore.

sanderegg commented 2 days ago

@bisgaard-itis

I also think the limiting should be done on the api-server side, for the same reason you gave here (storage is too low level).
in traefik you might be able to add specific routes rate limits. there is also a rate limit middleware that we already use I think (hope). you might have one basic one and then maybe it's possible to add on top some different ones for the routes you wish to limit more.
what I am wondering is that you wish to rate limit on the number of requests per second only, what about limiting by the number of jobs the api-server(s) launch? cause once storage issue is solved this will cascade to the director-v2 I think

pcrespov commented 2 days ago

@bisgaard-itis

I understand that the goal is to put an upper-cap limit to the number of clones/s that the system can attend.
We do not know what is the right number for the limit so I would be conservative and then increase slowly as we have more control on the capacity of the backend
So the solutions you propose to limit
- traefik: limit req/s to given given entrypoints
- middleware: same as above but done in api-server
- asyncio.Semaphore: same as previous but more granular (e.g. can limit calls to webserver.clone only)
Nonetheless, none of the above are distributed, i.e. the limits apply per replica and not per service

bisgaard-itis commented 2 days ago

@bisgaard-itis

* I also think the limiting should be done on the api-server side, for the same reason you gave here (storage is too low level).

* in traefik you might be able to add specific routes rate limits. there is also a rate limit middleware that we already use I think (hope). you might have one basic one and then maybe it's possible to add on top some different ones for the routes you wish to limit more.

* what I am wondering is that you wish to rate limit on the number of requests per second only, what about limiting by the number of jobs the api-server(s) launch? cause once storage issue is solved this will cascade to the director-v2 I think

What I want to limit is the number of requests to a given endpoint in the api-server being processed at any given time. E.g. I would like to set the limit that a maximum of 5 clone study operations can be performed at any given time. If a request hits that endpoint and 5 requests to that endpoint are already in progress I would return a 429. Concerning the director-v2 thing, the idea would be to have a general way to do this so we can easily limit other endpoints if needed. I will also go ahead and limit the start study endpoint in the api-server straight away.

bisgaard-itis commented 2 days ago

middleware: same as above but done in api-server

My understanding is that the only really distributed solution is to use traefik (https://doc.traefik.io/traefik/middlewares/http/inflightreq/). There we already limit the number of requests per second which can hit the api-server. But we could also limit the number of in-flight requests to the api-server there. That would be distributed. But it doesn't really hit the target of this issue: To have control over the number of requests to a given endpoint in progress.

wvangeit commented 2 days ago

I understand the intention, and am not against it per se. But to be honest I kind of doubt it will really solve the problem we have, and possibly make it worse. As mentioned in our meeting last week, what probably brought the system to its knees last week, was me working on studies (not using any metamodelling/cloning at all), that had interactive studies with massive amounts of data, that were basically taking storage slots. The solution here would have been the system limiting these tranfers, not the cloning by another user. So limiting cloning would only artificially solve the problem, with probably too low limits for the amount of clones to happen. Also, not all clones are equal. Cloning a study that requires 5 MB to transfer will have a completely different consequence than a study cloning 20GB.

bisgaard-itis commented 2 days ago

I understand the intention, and am not against it per se. But to be honest I kind of doubt it will really solve the problem we have, and possibly make it worse. As mentioned in our meeting last week, what probably brought the system to its knees last week, was me working on studies (not using any metamodelling/cloning at all), that had interactive studies with massive amounts of data, that were basically taking storage slots. The solution here would have been the system limiting these tranfers, not the cloning by another user. So limiting cloning would only artificially solve the problem, with probably too low limits for the amount of clones to happen. Also, not all clones are equal. Cloning a study that requires 5 MB to transfer will have a completely different consequence than a study cloning 20GB.

OK, I see your point. Of course it would be good to target a solution which prevents us seeing the same issue in the future. So if not limiting in the api-server, then the question is what we should then do to avoid this happening in the future? From the analysis we did with @sanderegg last week "taking storage slots" basically meant holding db connections while uploading/copying data via storage. Sylvain is working a solution for not holding the db connections while doing the upload/copying, so that should probably fix the issue for now, meaning we might see the issue/bottleneck somewhere else next time.

My understanding (correct me if I am wrong @sanderegg) is that once the "holding of db connections" issue is resolved uploading a 20GB file or a 5MB file should put the same load on storage because the actual uploading of data is done directly to S3. What really matters from the point of view of storage is the number of upload/copy operations which are performed concurrently. Do you agree @sanderegg?

If this is correct I would say a correct approach to reducing backpressure in storage would be to limit the total number of upload/copy operations which can be in progress at any given time. If we want to make sure that these limits also hit users in the dynamic sidecar I guess we will have to perform this limiting directly in storage. I see your point @wvangeit that we could be setting these limits too low which would impact user experience, but in a sense I understand that the whole point is that we want to have the control over those limits to be conservative. Otherwise I simply wouldn't know how to do it. But my ears are open for other solutions. Let me know.

sanderegg commented 2 days ago

Once I get the DB locking in storage in control:

5GB or 20GB should make no difference in terms of Database locking
it will still make a difference through on storage load (although many order smaller) as S3 does not really allow in place copying or maybe up to 5GB which I still need to find out.

But as we discussed:

moving from REST api to message system for these copies would allow storage to only take as many jobs as it can, therefore the api-server will have to wait, which means it might be overwhelmed with rest calls??
we could also design some higher priority queues, priorizing the webserver over the api-server
there is no silver bullet and we will have to experiment.
Nevertheless @wvangeit I prefer a slightly less nice user experience to having the whole deployment broken. And everyone out there sets rate limits to prevent their servers to burn ;). The idea of the limit is that we can guarantee that the service runs, until we fix things and deploy higher limits vs we run to fix things and have no time to make long terms fixes.

bisgaard-itis commented 2 days ago

Once I get the DB locking in storage in control:

* 5GB or 20GB should make no difference in terms of Database locking

* it will still make a difference through on storage load (although many order smaller) as S3 does not really allow in place copying or maybe up to 5GB which I still need to find out.

But as we discussed:

* moving from REST api to message system for these copies would allow storage to only take as many jobs as it can, therefore the api-server will have to wait, which means it might be overwhelmed with rest calls??

* we could also design some higher priority queues, priorizing the webserver over the api-server

* there is no silver bullet and we will have to experiment.

* Nevertheless @wvangeit I prefer a slightly less nice user experience to having the whole deployment broken. And everyone out there sets rate limits to prevent their servers to burn ;). The idea of the limit is that we can guarantee that the service runs, until we fix things and deploy higher limits vs we run to fix things and have no time to make long terms fixes.

Concerning the api-server getting overwhelmed with REST requests, I would say a simple solution is to use the Traefik inflight middleware( https://doc.traefik.io/traefik/middlewares/http/inflightreq/ ). That guarantees that the number of requests the api-server is processing at any given time will not exceed a limit. That of course means that the user will receive a lot of 429s, but we have retry strategies for that in the client.

wvangeit commented 2 days ago

Yes. I just mean to make things a bit more dynamic instead of us having to artificially tune limits all over the place. Imo hard limiting is best at the lowest level, probably here storage. And then when higher level services are hit with limit-reached responses from lower level services they start backing off. (Ideally higher level services can query some status before they reach the limit, but that could be a later stage).

I'm just saying this out of experience wrt to my parallelworker artificial limit. The limit of 10 workers worked for my studies, and it didnt see any real stress on the system. Then it failed for one user, because some completely different studies were running in the background. And for another user it started failing at 3 parallel tasks. Even with things now being file-size independent, I'm afraid we'll end up in an endless cycle of tuning of we dont' automate this.

bisgaard-itis commented 2 days ago

Yes. I just mean to make things a bit more dynamic instead of us having to artificially tune limits all over the place. Imo hard limiting is best at the lowest level, probably here storage. And then when higher level services are hit with limit-reached responses from lower level services they start backing off. (Ideally higher level services can query some status before they reach the limit, but that could be a later stage).

I'm just saying this out of experience wrt to my parallelworker artificial limit. The limit of 10 workers worked for my studies, and it didnt see any real stress on the system. Then it failed for one user, because some completely different studies were running in the background. And for another user it started failing at 3 parallel tasks. Even with things now being file-size independent, I'm afraid we'll end up in an endless cycle of tuning of we dont' automate this.

Very good point! OK, then I will try to add these absolute limits for individual endpoints directly in storage.