Open mln-wave opened 2 years ago
Logs in the deployed endpoint are:
@mln-wave what's the GPU type you are using?
I tested this with T4s and autoscaling works, but its kind of slow. So doing two concurrent requests seems to kill the first instance before the 2nd one comes up. I then tried running requests one after another and I could see the number of replicas increasing in the console, but I was still not able to make concurrent requests. Based on the Vertex docs, I suspect that there is a scaling issue based on CPU usage being underutilized even if the number of replicas increases.
Vertex AI scales your nodes based on CPU usage even if you have configured your prediction nodes to use GPUs; therefore if your prediction throughput is causing high GPU usage, but not high CPU usage, your nodes might not scale as you expect, as autoscaling with [MetricSpecs](https://cloud.google.com/vertex-ai/docs/reference/rest/v1/DedicatedResources#autoscalingmetricspec) will autoscale to the most utilized resource.
On a V100, I can do multiple concurrent requests, but eventually will hit a limit and the same problem as above happens.
I'll be doing more experimentation this week and will get back to you on the results.
Hi @entrpn ,
Ya I too read reg. that MetricSpecs and CPU utilization consideration. Based on the resources usage graph on the deployed endpoint, since GPU and CPU resources gradually increase and don't consume upto 60% initially only, 2 requests are being alloted to 1 replica. Hence I then tried setting autoscaling_target_cpu_utilization=5
, autoscaling_target_accelerator_duty_cycle=5
(as per deploy params here which is MetricsSpecs in Python SDK), so that the 2nd request would get allocated to 2nd replica. But I again got 503, 502 errors initially only (sooner than previous).
Don't know exactly what's happening! But I think people have used vertex ai with GPUs for auto-scaling for any other tasks. Couldn't understand how they've managed, couldn't find blogs/tutorials etc for successful autoscaling with GPU etc.
Practically also, concurrent requests may happen in more probability only. Since once deployed, multiple people can try at once.
And since there is no auto scaling to 0 in vertex ai, one instance would be running all the time even if users are are not using it. In such case V100 being idle would incur high costs, and waste of resources, money. So trying with nvidia T4.
@TejaswiniiB @mln-wave instead of waiting for Vertex to autoscale, which is kind of slow, you can start with multiple replicas. I also decreated the autoscaling target utilization for cpu and gpu. It seems that lowering gpu utilization is what forces requests to go to other replicas. I tested it and it seems to work well.
python gcp_deploy.py --image-uri gcr.io/<project_id>/stable-diffusion:latest --max-replica-count 5 --min-replica-count 5
Change the replica counts to what fits best for you.
Be aware that
Hi @entrpn
Thankyou for your quick responses and help.
But if we set min-replica-count = 5
, won't all the 5 replicas be running all time, even when there is no traffic?
@TejaswiniiB it will. You can set the count to a lower number. Just eyeballing, it can take up to 10 minutes for new replicas to become ready.
Okay, got it. Thankyou!
I have set max_replica_count to 5, and when tried making 2 conceurrent requests to the deployed endpoint, getting error as "google.api_core.exceptions.ServiceUnavailable: 503 502:Bad Gateway".
Why isn't it auto-scaling? What to do for autoscaling?