Closed thomas-h-w closed 6 months ago
Hi @thomas-h-w
In a positive scenario where instance creation succeeds, we perform two requests to SM per instance. Could you please provide more details about your process? Do your instances creation succeed or fail? do you create also binding for each instance? Which release of the operator are you using?
Regards, Naama
Which release of the operator are you using?
Hi, regarding the version: we use in our repo:
image:
repository: ghcr.io/sap/sap-btp-service-operator/controller
tag: v0.6.1
Could you please provide more details about your process?
The process is as follows:
Do your instances creation succeed or fail? do you create also binding for each instance?
12 succeeded, the rest failed. No, no service bindings. Now, we repeated the experiment (after deleting all instances and also disabled the ArgoCD process) with 50 instances. Now all creations fail.
We see BTP Service Operator logs like this (for all 50 instances):
2024-05-06T12:20:07Z INFO controllers.ServiceInstance instance is not in final state, async operation is in progress (/v1/service_instances/c10bfc1c-5684-418c-a5aa-8bfb5d3ab9c4/operations/4399bc00-e91f-4d
30-adcc-60cb3dcf83f0) {"serviceinstance": {"name":"data-lake-noiq-18","namespace":"dbinfra"}, "correlation_id": "88cb7afd-db91-46ec-8c03-92e6654219ec"}
2024-05-06T12:20:07Z INFO controllers.ServiceInstance resource is in progress, found operation url /v1/service_instances/c10bfc1c-5684-418c-a5aa-8bfb5d3ab9c4/operations/4399bc00-e91f-4d30-adcc-60cb3dcf83f
0 {"serviceinstance": {"name":"data-lake-noiq-18","namespace":"dbinfra"}, "correlation_id": "88cb7afd-db91-46ec-8c03-92e6654219ec"}
2024-05-06T12:20:07Z INFO secret-resolver Searching for secret in resource namespace {"namespace": "dbinfra", "name": "sap-btp-service-operator"}
2024-05-06T12:20:07Z INFO secret-resolver Searching for namespace secret in management namespace {"namespace": "dbinfra", "managementNamespace": "sap-btp-operator", "name": "sap-btp-service-operator"}
2024-05-06T12:20:07Z INFO secret-resolver Searching for cluster secret {"releaseNamespace": "sap-btp-operator", "name": "sap-btp-service-operator"}
2024-05-06T12:20:07Z INFO controllers.ServiceInstance last operation description is 'clusterfilecontainer-c10bfc1c-5684-418c-a5aa-8bfb5d3ab9c4: spec updated, clusterfilecontainer-c10bfc1c-5684-418c-a5aa-8
bfb5d3ab9c4-sofcat: spec updated, clusterfilecontainer-c10bfc1c-5684-418c-a5aa-8bfb5d3ab9c4-sofres: spec updated' {"serviceinstance": {"name":"data-lake-noiq-18","namespace":"dbinfra"}, "correlation_id": "88c
b7afd-db91-46ec-8c03-92e6654219ec"}
2024-05-06T12:20:07Z INFO controllers.ServiceInstance setting inProgress conditions: reason: CreateInProgress, message:clusterfilecontainer-c10bfc1c-5684-418c-a5aa-8bfb5d3ab9c4: spec updated, clusterfilec
ontainer-c10bfc1c-5684-418c-a5aa-8bfb5d3ab9c4-sofcat: spec updated, clusterfilecontainer-c10bfc1c-5684-418c-a5aa-8bfb5d3ab9c4-sofres: spec updated, generation: 1 {"serviceinstance": {"name":"data-lake-noiq-18
","namespace":"dbinfra"}, "correlation_id": "88cb7afd-db91-46ec-8c03-92e6654219ec"}
2024-05-06T12:20:07Z INFO controllers.ServiceInstance updating ServiceInstance status {"serviceinstance": {"name":"data-lake-noiq-18","namespace":"dbinfra"}, "correlation_id": "88cb7afd-db91-46ec-8c03-92e
6654219ec"}
and many more like this.
Can we do more to troubleshoot?
Hi @thomas-h-w
You are creating 50 instances in parallel of hana-cloud, the creation of this service is asyncornouse and takes time, the operator polls the broker (through service-manager) every 10 seconds (for each instance) until the instance is ready and this is the reason you get to the limit. you can increase the default polling interval in the values.yaml to a value that fits this service if this is your main use case. what is the error message of one of the failed instances? how many service instances are going to be created in parallel eventually? Thanks, Keren.
Hi @kerenlahav,
we don't recommand changing it, for most service 10 seconds is enough to be ready. add the following line to the helm command: (this will change the polling interval to 1 minute) --set manager.poll_interval=60000000000 (time in nano seconds)
to see what is the error message attach the yaml of one of the failed instances kubectl get serviceinstance instance-name -n namespace-name -o yaml
Do you know how long it takes to create one instance? BTW, the limit is temporary, it will be resolved eventaully without any user involvement.
Information I get from the btp CLI on one of the instances:
~ btp get services/instance 7d7ffad4-e920-4f60-9352-5d52a21857c5 --subaccount 9aa9e615-23ab-40d3-8504-249df1ef118d
id: 7d7ffad4-e920-4f60-9352-5d52a21857c5
ready: false
last_operation:
id: 62b423da-81a3-4952-8338-90be557ba60c
ready: true
description: clusterfilecontainer-7d7ffad4-e920-4f60-9352-5d52a21857c5: spec updated, clusterfilecontainer-7d7ffad4-e920-4f60-9352-5d52a21857c5-sofcat: spec updated, clusterfilecontainer-7d7ffad4-e920-4f60-9352-5d52a21857c5-sofres: spec updated
type: create
state: in progress
resource_id: 7d7ffad4-e920-4f60-9352-5d52a21857c5
resource_type: /v1/service_instances
platform_id: service-manager
correlation_id: 44477e22-d225-469f-591a-7a4b00d51488
reschedule: true
reschedule_timestamp: 2024-05-06T09:01:41.586244Z
deletion_scheduled: 0001-01-01T00:00:00Z
created_at: 2024-05-06T09:01:40.865592Z
updated_at: 2024-05-06T15:23:42.517177Z
name: data-lake-noiq-1
service_plan_id: 227ed822-446f-4367-9f31-29675673b6bb
platform_id: 4a514b5a-5379-4b6b-a08d-1feabbcbc72b
dashboard_url: https://suite-analytics-gl2abltr.hana-tooling.ingress.orchestration.canary-eu10.hanacloud.ondemand.com/start?host=7d7ffad4-e920-4f60-9352-5d52a21857c5.files.hdl.canary-eu10.hanacloud.ondemand.com
context:
clusterid: 1AF5F601-1C89-C628-5729-1B3E750C4F55
namespace: dbinfra
license_type: SAPDEV
subdomain: suite-analytics-gl2abltr
crm_customer_id:
platform: sapcp
zone_id: 91ec7db5-989a-4c9f-93fa-ecf14310a029
global_account_id: 5367be90-a8d1-4398-8754-2ccf069b176b
subaccount_id: 9aa9e615-23ab-40d3-8504-249df1ef118d
region: cf-eu10-canary
env_type: kubernetes
origin: kubernetes
instance_name: data-lake-noiq-1
usable: false
subaccount_id: 9aa9e615-23ab-40d3-8504-249df1ef118d
protected: <null>
created_at: 2024-05-06T09:01:40.865589Z
updated_at: 2024-05-06T09:01:40.865589Z
labels: _k8sname = data-lake-noiq-1; operated_by = 4a514b5a-5379-4b6b-a08d-1feabbcbc72b; subaccount_id = 9aa9e615-23ab-40d3-8504-249df1ef118d
this instance is not in failed state, it is still being created, see last operation info
Ah, sorry for the confusion. From the ArgoCD app, it's reported as "degraded", but that's sth different - from this perspective it's "failed" I guess. Anyway, how to proceed: We are waiting for the creation of hana-cloud/relational-data-lake instances for ~8 hours now, but none was created. Where can we further troubleshoot this? Maybe we should look at the status of the Service Manager? Or the respective backend broker? How to do it?
can you please restart the btp operator and see if the instance status changed after ~3-4 minutes?
The get serviceinstance outoput is as follows (beginning omitted):
status:
conditions:
- lastTransitionTime: "2024-05-06T09:01:40Z"
message: The allowed request limit of 6000 requests has been reached please try
again later
observedGeneration: 1
reason: CreateInProgress
status: "False"
type: Succeeded
- lastTransitionTime: "2024-05-06T09:01:40Z"
message: ""
reason: NotProvisioned
status: "False"
type: Ready
observedGeneration: 1
Do you know how long it takes to create one instance?
Usually ~5-10 min
did the status change after the restart? i'm trying to understand if there is a bug that the operator stops trying to poll after a while
Restarted the operator:
k get pods -n=sap-btp-operator
NAME READY STATUS RESTARTS AGE
sap-btp-operator-controller-manager-7999b858dc-fqjd9 2/2 Running 0 76s
sap-btp-operator-controller-manager-7999b858dc-nhwkw 2/2 Running 0 93s
But no status change:
status:
conditions:
- lastTransitionTime: "2024-05-06T09:01:40Z"
message: The allowed request limit of 6000 requests has been reached please try
again later
observedGeneration: 1
reason: CreateInProgress
status: "False"
type: Succeeded
- lastTransitionTime: "2024-05-06T09:01:40Z"
message: ""
reason: NotProvisioned
status: "False"
type: Ready
and
~ btp get services/instance 7d7ffad4-e920-4f60-9352-5d52a21857c5 --subaccount 9aa9e615-23ab-40d3-8504-249df1ef118d
id: 7d7ffad4-e920-4f60-9352-5d52a21857c5
ready: false
last_operation:
id: 62b423da-81a3-4952-8338-90be557ba60c
ready: true
description: clusterfilecontainer-7d7ffad4-e920-4f60-9352-5d52a21857c5: spec updated, clusterfilecontainer-7d7ffad4-e920-4f60-9352-5d52a21857c5-sofcat: spec updated, clusterfilecontainer-7d7ffad4-e920-4f60-9352-5d52a21857c5-sofres: spec updated
type: create
state: in progress
ok thank you for the information, we will investigae and update which btp landscape is it?
I guess the piece of information you are looking for is cf-eu10-canary. The other facts are: Subdomain: suite-analytics-gl2abltr Tenant ID: 91ec7db5-989a-4c9f-93fa-ecf14310a029 Subaccount ID: 9aa9e615-23ab-40d3-8504-249df1ef118d Provider: Amazon Web Services (AWS) Region: Europe (Frankfurt) - Canary URL https://cpcli.cf.sap.hana.ondemand.com
Hi @kerenlahav, any update from your side? Can we stop the experiment for now and delete the instances? Or are you still analyzing?
Hi @thomas-h-w according to service-manager logs it took hana broker 2 days to create the instance, please open NGPBUG to service-manager and we'll forward it with the relevant information to hana broker.
Hi @kerenlahav OK, I created https://jira.tools.sap/browse/NGPBUG-387686. Can we also see these SM logs - how to access them? Because for us it would be very helpful to be able to troubleshoot ourselves before we reach out to you. It's just an extra hop.
Hi @thomas-h-w
I've updated the ticket with the logs. To view these logs, you'll need to retrieve the correlation ID from the operation using the BTP CLI and then search for it in Kibana.
Regards, Naama
handled in bug https://jira.tools.sap/browse/NGPBUG-387686
Hello, we tried to deploy 100 k8s manifests of
but we very quickly ran into rate limiting on the Service Manager side:
The allowed request limit of 6000 requests has been reached
. With the first run we made it only to 3 instances. After some hours, we reached 12 (out of 100).Question: What can we do to increase the throughput? How to reduce the number of requests? For us it's unclear why the creation of 100 instances should create 6000 requests. Or should we reduce the batch size?
Please note, that ultimately, we want to be able to provision a few thousands of instances in an HDLFS multitenancy onboarding scenario.
Best regards, Thomas