SAP / sap-btp-service-operator

SAP BTP service operator enables developers to connect Kubernetes clusters to SAP BTP accounts and to consume SAP BTP services within the clusters by using Kubernetes native tools.
Apache License 2.0
126 stars 52 forks source link

Low throughput when creating 100 HDLFS instances #428

Closed thomas-h-w closed 6 months ago

thomas-h-w commented 6 months ago

Hello, we tried to deploy 100 k8s manifests of

kind: ServiceInstance
...
spec:
  serviceOfferingName: hana-cloud
  servicePlanName: relational-data-lake

but we very quickly ran into rate limiting on the Service Manager side: The allowed request limit of 6000 requests has been reached. With the first run we made it only to 3 instances. After some hours, we reached 12 (out of 100).

Question: What can we do to increase the throughput? How to reduce the number of requests? For us it's unclear why the creation of 100 instances should create 6000 requests. Or should we reduce the batch size?

Please note, that ultimately, we want to be able to provision a few thousands of instances in an HDLFS multitenancy onboarding scenario.

Best regards, Thomas

I065450 commented 6 months ago

Hi @thomas-h-w

In a positive scenario where instance creation succeeds, we perform two requests to SM per instance. Could you please provide more details about your process? Do your instances creation succeed or fail? do you create also binding for each instance? Which release of the operator are you using?

Regards, Naama

thomas-h-w commented 6 months ago

Which release of the operator are you using?

Hi, regarding the version: we use in our repo:

image:
    repository: ghcr.io/sap/sap-btp-service-operator/controller
    tag: v0.6.1
thomas-h-w commented 6 months ago

Could you please provide more details about your process?

The process is as follows:

  1. We have the BTP Service Operator in AWS EKS cluster (no Kyma, no CF)
  2. We create an ArgoCD app monitoring our repo. The repo then gets 100 manifest files submitted (PR). ArgoCD picks this up and basically calls `kubectl apply -f 100 times.
thomas-h-w commented 6 months ago

Do your instances creation succeed or fail? do you create also binding for each instance?

12 succeeded, the rest failed. No, no service bindings. Now, we repeated the experiment (after deleting all instances and also disabled the ArgoCD process) with 50 instances. Now all creations fail.

thomas-h-w commented 6 months ago

We see BTP Service Operator logs like this (for all 50 instances):

2024-05-06T12:20:07Z    INFO    controllers.ServiceInstance     instance is not in final state, async operation is in progress (/v1/service_instances/c10bfc1c-5684-418c-a5aa-8bfb5d3ab9c4/operations/4399bc00-e91f-4d
30-adcc-60cb3dcf83f0)   {"serviceinstance": {"name":"data-lake-noiq-18","namespace":"dbinfra"}, "correlation_id": "88cb7afd-db91-46ec-8c03-92e6654219ec"}
2024-05-06T12:20:07Z    INFO    controllers.ServiceInstance     resource is in progress, found operation url /v1/service_instances/c10bfc1c-5684-418c-a5aa-8bfb5d3ab9c4/operations/4399bc00-e91f-4d30-adcc-60cb3dcf83f
0       {"serviceinstance": {"name":"data-lake-noiq-18","namespace":"dbinfra"}, "correlation_id": "88cb7afd-db91-46ec-8c03-92e6654219ec"}
2024-05-06T12:20:07Z    INFO    secret-resolver Searching for secret in resource namespace      {"namespace": "dbinfra", "name": "sap-btp-service-operator"}
2024-05-06T12:20:07Z    INFO    secret-resolver Searching for namespace secret in management namespace  {"namespace": "dbinfra", "managementNamespace": "sap-btp-operator", "name": "sap-btp-service-operator"}
2024-05-06T12:20:07Z    INFO    secret-resolver Searching for cluster secret    {"releaseNamespace": "sap-btp-operator", "name": "sap-btp-service-operator"}
2024-05-06T12:20:07Z    INFO    controllers.ServiceInstance     last operation description is 'clusterfilecontainer-c10bfc1c-5684-418c-a5aa-8bfb5d3ab9c4: spec updated, clusterfilecontainer-c10bfc1c-5684-418c-a5aa-8
bfb5d3ab9c4-sofcat: spec updated, clusterfilecontainer-c10bfc1c-5684-418c-a5aa-8bfb5d3ab9c4-sofres: spec updated'       {"serviceinstance": {"name":"data-lake-noiq-18","namespace":"dbinfra"}, "correlation_id": "88c
b7afd-db91-46ec-8c03-92e6654219ec"}
2024-05-06T12:20:07Z    INFO    controllers.ServiceInstance     setting inProgress conditions: reason: CreateInProgress, message:clusterfilecontainer-c10bfc1c-5684-418c-a5aa-8bfb5d3ab9c4: spec updated, clusterfilec
ontainer-c10bfc1c-5684-418c-a5aa-8bfb5d3ab9c4-sofcat: spec updated, clusterfilecontainer-c10bfc1c-5684-418c-a5aa-8bfb5d3ab9c4-sofres: spec updated, generation: 1       {"serviceinstance": {"name":"data-lake-noiq-18
","namespace":"dbinfra"}, "correlation_id": "88cb7afd-db91-46ec-8c03-92e6654219ec"}
2024-05-06T12:20:07Z    INFO    controllers.ServiceInstance     updating ServiceInstance status {"serviceinstance": {"name":"data-lake-noiq-18","namespace":"dbinfra"}, "correlation_id": "88cb7afd-db91-46ec-8c03-92e
6654219ec"}
and many more like this.

Can we do more to troubleshoot?

kerenlahav commented 6 months ago

Hi @thomas-h-w

You are creating 50 instances in parallel of hana-cloud, the creation of this service is asyncornouse and takes time, the operator polls the broker (through service-manager) every 10 seconds (for each instance) until the instance is ready and this is the reason you get to the limit. you can increase the default polling interval in the values.yaml to a value that fits this service if this is your main use case. what is the error message of one of the failed instances? how many service instances are going to be created in parallel eventually? Thanks, Keren.

thomas-h-w commented 6 months ago

Hi @kerenlahav,

  1. in the README I didn't find documentation how to set the default polling interval in the values.yaml. Can you please give me a hint?
  2. Where can I find the error message of one of the failed instances?
  3. For the rollout of the product, we expect to create a few thousand service instances. Not necessarily all in parallel, but in a reasonable time frame (hours, or a weekend, but not weeks).
kerenlahav commented 6 months ago

we don't recommand changing it, for most service 10 seconds is enough to be ready. add the following line to the helm command: (this will change the polling interval to 1 minute) --set manager.poll_interval=60000000000 (time in nano seconds)

to see what is the error message attach the yaml of one of the failed instances kubectl get serviceinstance instance-name -n namespace-name -o yaml

Do you know how long it takes to create one instance? BTW, the limit is temporary, it will be resolved eventaully without any user involvement.

thomas-h-w commented 6 months ago

Information I get from the btp CLI on one of the instances:

~ btp get services/instance 7d7ffad4-e920-4f60-9352-5d52a21857c5 --subaccount 9aa9e615-23ab-40d3-8504-249df1ef118d
id: 7d7ffad4-e920-4f60-9352-5d52a21857c5
ready: false
last_operation:
  id: 62b423da-81a3-4952-8338-90be557ba60c
  ready: true
  description: clusterfilecontainer-7d7ffad4-e920-4f60-9352-5d52a21857c5: spec updated, clusterfilecontainer-7d7ffad4-e920-4f60-9352-5d52a21857c5-sofcat: spec updated, clusterfilecontainer-7d7ffad4-e920-4f60-9352-5d52a21857c5-sofres: spec updated
  type: create
  state: in progress
  resource_id: 7d7ffad4-e920-4f60-9352-5d52a21857c5
  resource_type: /v1/service_instances
  platform_id: service-manager
  correlation_id: 44477e22-d225-469f-591a-7a4b00d51488
  reschedule: true
  reschedule_timestamp: 2024-05-06T09:01:41.586244Z
  deletion_scheduled: 0001-01-01T00:00:00Z
  created_at: 2024-05-06T09:01:40.865592Z
  updated_at: 2024-05-06T15:23:42.517177Z
name: data-lake-noiq-1
service_plan_id: 227ed822-446f-4367-9f31-29675673b6bb
platform_id: 4a514b5a-5379-4b6b-a08d-1feabbcbc72b
dashboard_url: https://suite-analytics-gl2abltr.hana-tooling.ingress.orchestration.canary-eu10.hanacloud.ondemand.com/start?host=7d7ffad4-e920-4f60-9352-5d52a21857c5.files.hdl.canary-eu10.hanacloud.ondemand.com
context:
  clusterid: 1AF5F601-1C89-C628-5729-1B3E750C4F55
  namespace: dbinfra
  license_type: SAPDEV
  subdomain: suite-analytics-gl2abltr
  crm_customer_id:
  platform: sapcp
  zone_id: 91ec7db5-989a-4c9f-93fa-ecf14310a029
  global_account_id: 5367be90-a8d1-4398-8754-2ccf069b176b
  subaccount_id: 9aa9e615-23ab-40d3-8504-249df1ef118d
  region: cf-eu10-canary
  env_type: kubernetes
  origin: kubernetes
  instance_name: data-lake-noiq-1
usable: false
subaccount_id: 9aa9e615-23ab-40d3-8504-249df1ef118d
protected: <null>
created_at: 2024-05-06T09:01:40.865589Z
updated_at: 2024-05-06T09:01:40.865589Z
labels: _k8sname = data-lake-noiq-1; operated_by = 4a514b5a-5379-4b6b-a08d-1feabbcbc72b; subaccount_id = 9aa9e615-23ab-40d3-8504-249df1ef118d
kerenlahav commented 6 months ago

this instance is not in failed state, it is still being created, see last operation info

thomas-h-w commented 6 months ago

Ah, sorry for the confusion. From the ArgoCD app, it's reported as "degraded", but that's sth different - from this perspective it's "failed" I guess. Anyway, how to proceed: We are waiting for the creation of hana-cloud/relational-data-lake instances for ~8 hours now, but none was created. Where can we further troubleshoot this? Maybe we should look at the status of the Service Manager? Or the respective backend broker? How to do it?

kerenlahav commented 6 months ago

can you please restart the btp operator and see if the instance status changed after ~3-4 minutes?

thomas-h-w commented 6 months ago

The get serviceinstance outoput is as follows (beginning omitted):

status:
  conditions:
  - lastTransitionTime: "2024-05-06T09:01:40Z"
    message: The allowed request limit of 6000 requests has been reached please try
      again later
    observedGeneration: 1
    reason: CreateInProgress
    status: "False"
    type: Succeeded
  - lastTransitionTime: "2024-05-06T09:01:40Z"
    message: ""
    reason: NotProvisioned
    status: "False"
    type: Ready
  observedGeneration: 1
thomas-h-w commented 6 months ago

service_instance.txt

thomas-h-w commented 6 months ago

Do you know how long it takes to create one instance?

Usually ~5-10 min

kerenlahav commented 6 months ago

did the status change after the restart? i'm trying to understand if there is a bug that the operator stops trying to poll after a while

thomas-h-w commented 6 months ago

Restarted the operator:

k get pods -n=sap-btp-operator
NAME                                                   READY   STATUS    RESTARTS   AGE
sap-btp-operator-controller-manager-7999b858dc-fqjd9   2/2     Running   0          76s
sap-btp-operator-controller-manager-7999b858dc-nhwkw   2/2     Running   0          93s

But no status change:

status:
  conditions:
  - lastTransitionTime: "2024-05-06T09:01:40Z"
    message: The allowed request limit of 6000 requests has been reached please try
      again later
    observedGeneration: 1
    reason: CreateInProgress
    status: "False"
    type: Succeeded
  - lastTransitionTime: "2024-05-06T09:01:40Z"
    message: ""
    reason: NotProvisioned
    status: "False"
    type: Ready

and

 ~ btp get services/instance 7d7ffad4-e920-4f60-9352-5d52a21857c5 --subaccount 9aa9e615-23ab-40d3-8504-249df1ef118d
id: 7d7ffad4-e920-4f60-9352-5d52a21857c5
ready: false
last_operation:
  id: 62b423da-81a3-4952-8338-90be557ba60c
  ready: true
  description: clusterfilecontainer-7d7ffad4-e920-4f60-9352-5d52a21857c5: spec updated, clusterfilecontainer-7d7ffad4-e920-4f60-9352-5d52a21857c5-sofcat: spec updated, clusterfilecontainer-7d7ffad4-e920-4f60-9352-5d52a21857c5-sofres: spec updated
  type: create
  state: in progress
kerenlahav commented 6 months ago

ok thank you for the information, we will investigae and update which btp landscape is it?

thomas-h-w commented 6 months ago

I guess the piece of information you are looking for is cf-eu10-canary. The other facts are: Subdomain: suite-analytics-gl2abltr Tenant ID: 91ec7db5-989a-4c9f-93fa-ecf14310a029 Subaccount ID: 9aa9e615-23ab-40d3-8504-249df1ef118d Provider: Amazon Web Services (AWS) Region: Europe (Frankfurt) - Canary URL https://cpcli.cf.sap.hana.ondemand.com

thomas-h-w commented 6 months ago

Hi @kerenlahav, any update from your side? Can we stop the experiment for now and delete the instances? Or are you still analyzing?

kerenlahav commented 6 months ago

Hi @thomas-h-w according to service-manager logs it took hana broker 2 days to create the instance, please open NGPBUG to service-manager and we'll forward it with the relevant information to hana broker.

thomas-h-w commented 6 months ago

Hi @kerenlahav OK, I created https://jira.tools.sap/browse/NGPBUG-387686. Can we also see these SM logs - how to access them? Because for us it would be very helpful to be able to troubleshoot ourselves before we reach out to you. It's just an extra hop.

I065450 commented 6 months ago

Hi @thomas-h-w

I've updated the ticket with the logs. To view these logs, you'll need to retrieve the correlation ID from the operation using the BTP CLI and then search for it in Kibana.

Regards, Naama

kerenlahav commented 6 months ago

handled in bug https://jira.tools.sap/browse/NGPBUG-387686