Azure / open-service-broker-azure

The Open Service Broker API Server for Azure Services
https://osba.sh
MIT License
248 stars 100 forks source link

postgresql successfully deployed on azure but ends up with OrphanMitigation, finally gets deleted and instance state is failed #681

Closed cforce closed 5 years ago

cforce commented 5 years ago

svcat provision test --class azure-postgresql-10 --plan basic -ndev --params-json '{"cores":1,"storage":10,"backupRetention":7,"location": "northeurope","resourceGroup": "deveuaks","firewallRules": [{"startIPAddress": "0.0.0.0","endIPAddress": "255.255.255.255","name":"AllowAll"}]}' --logtostderr

Is the below reason why i can't provision posgresql and end up with OrphanMitigation before it goes to failed.? Also see https://social.msdn.microsoft.com/Forums/azure/en-US/30a90ddd-0949-42c4-9504-fc7a8756fbe6/postgresql-virtual-net-rule-issue-when-having-basic-tier?forum=AzureDatabaseforPostgreSQL

time="2019-02-28T15:57:23Z" level=info msg="Open Service Broker for Azure starting" commit=4797e90 version=v1.5.0
time="2019-02-28T15:57:23Z" level=info msg="Setting log level" logLevel=INFO
time="2019-02-28T15:57:23Z" level=info msg="Sensitive instance and binding details will be encrypted" encryptionScheme=AES256
time="2019-02-28T15:57:23Z" level=info msg="API server is listening with TLS enabled" address="https://0.0.0.0:8443"
time="2019-02-28T16:04:22Z" level=error msg="error executing job; not submitting any follow-up tasks" error="error executing provisioning step \"setupDatabase\" for instance \"a960f6a1-3b71-11e9-9603-56fc98800543\": error executing provisioning step: error starting transaction: pq: Client connections to Basic tier servers through Virtual Network Service Endpoints are not supported. Virtual Network Service Endpoints are supported for General Purpose and Memory Optimized severs." job=executeProvisioningStep taskID=a0735aa3-6b08-44ac-b3fa-38e283889d7c
time="2019-02-28T16:33:06Z" level=error msg="error executing job; not submitting any follow-up tasks" error="error executing provisioning step \"setupDatabase\" for instance \"ad92f9a6-3b75-11e9-9603-56fc98800543\": error executing provisioning step: error starting transaction: pq: Client connections to Basic tier servers through Virtual Network Service Endpoints are not supported. Virtual Network Service Endpoints are supported for General Purpose and Memory Optimized severs." job=executeProvisioningStep taskID=ff79f672-e63d-42e4-8e8a-a20b3ba1b2c8

I even see in the resource group that the deployments runs. image image

Then it gets ready in azure portal postgres view..

..then OSB says OrphanMitigation and Azure Portal shows deleting it. image It seems to make no difference if i take posgresql 9.6, or 10, if plan is database general-purpose or basic

Also found this on the logs

ervicecatalog.ClusterServicePlan ended with: very short watch: github.com/kubernetes-incubator/service-catalog/pkg/client/informers_generated/internalversion/factory.go:118: Unexpected watch close - watch lasted less than a second and no items received
W0228 20:37:43.454447       1 reflector.go:270] github.com/kubernetes-incubator/service-catalog/pkg/client/informers_generated/internalversion/factory.go:118: watch of *servicecatalog.ClusterServiceClass ended with: very short watch: github.com/kubernetes-incubator/service-catalog/pkg/client/informers_generated/internalversion/factory.go:118: Unexpected watch close - watch lasted less than a second and no items received
W0228 20:50:22.854747       1 reflector.go:270] github.com/kubernetes-incubator/service-catalog/pkg/client/informers_generated/internalversion/factory.go:118: watch of *servicecatalog.ServiceInstance ended with: very short watch: github.com/kubernetes-incubator/service-catalog/pkg/client/informers_generated/internalversion/factory.go:118: Unexpected watch close - watch lasted less than a second and no items received
W0228 20:54:39.755997       1 reflector.go:270] github.com/kubernetes-incubator/service-catalog/pkg/client/informers_generated/internalversion/factory.go:118: watch of *servicecatalog.ClusterServiceClass ended with: very short watch: github.com/kubernetes-incubator/service-catalog/pkg/client/informers_generated/internalversion/factory.go:118: Unexpected watch close - watch lasted less than a second and no items received
W0228 20:58:02.756730       1 reflector.go:270] github.com/kubernetes-incubator/service-catalog/pkg/client/informers_generated/internalversion/factory.go:118: watch of *servicecatalog.ClusterServicePlan ended with: very short watch: github.com/kubernetes-incubator/service-catalog/pkg/client/informers_generated/internalversion/factory.go:118: Unexpected watch close - watch lasted less than a second and no items received
W0228 21:07:52.155131       1 reflector.go:270] github.com/kubernetes-incubator/service-catalog/pkg/client/informers_generated/internalversion/factory.go:118: watch of *servicecatalog.ClusterServicePlan ended with: very short watch: github.com/kubernetes-incubator/service-catalog/pkg/client/informers_generated/internalversion/factory.go:118: Unexpected watch close - watch lasted less than a second and no items received
W0228 21:10:00.859114       1 reflector.go:270] github.com/kubernetes-incubator/service-catalog/pkg/client/informers_generated/internalversion/factory.go:118: watch of *servicecatalog.ClusterServiceClass ended with: very short watch: github.com/kubernetes-incubator/service-catalog/pkg/client/informers_generated/internalversion/factory.go:118: Unexpected watch close - watch lasted less than a second and no items received

Does the SP HAS to be contributor of the whole subscription? Is that because of the server, the database or the firewall resources being created? What roles exactly do i need to limit the creation of resources to the defined resource group?

cforce commented 5 years ago

The issue is also existing on osb 1.4.0. I found out so far that the issue seems to be dependent on the location i deploy the db to. It works for eastus, but not northeu although postgresql is GA there too https://azure.microsoft.com/en-us/global-infrastructure/services/?products=postgresql&regions=non-regional,us-east,us-east-2,us-central,us-north-central,us-south-central,us-west-central,us-west,us-west-2,canada-east,canada-central,europe-north,europe-west

Could it be something with API's OSB is using not supported, firewall or vnet stuff what is maybe normally not part of the plain posgresql deployment on azure (not using OSB). Anyway my AKS, vnet and resource group generally are all located in northeu.

zhongyi-zhang commented 5 years ago

Does the SP HAS to be contributor of the whole subscription?

No, resource group scoped contributor is enough.

The issue is also existing on osb 1.4.0.

You mean the service endpoints error? OSBA v1.4.0 even doesn't support that. Actually I am just confused about the logs you provided why the error was raised by PostgreSQL client -- the provisioning step in OSBA to use PostgreSQL client is definitely after successfully creating the database instance and any other ARM resources. And the client is a common community client without any knowledge about service endpoints. Then, it is possible an Azure service bug.

I found out so far that the issue seems to be dependent on the location i deploy the db to.

This sounds increasing the possibility of service bug. Both regions' GA do not mean they sync roll out releases.

cforce commented 5 years ago

You mean the service endpoints error? OSBA v1.4.0 even doesn't support that. I have tried our first with 1.5.0 and then went done to 1.4.0 as i thought it might be a bug in that version until i stumbled over the fact that it seems to (also) location specific (still testing to proof more) I am just confused about the logs you provided why the error was raised by PostgreSQL client That logs come from the run on 1.5.0. Still confused then? IT definitely nis in the logs and points to posgresql fro the message. What does that feature "Virtual Network Service Endpoints" is about. Look like i can't us the whole plan or is it a combination of params and the plan?

the provisioning step in OSBA to use PostgreSQL client is definitely after successfully creating the database instance and any other ARM resources. And the client is a common community client without any knowledge about service endpoints. I looks like that the client? can't successfully validate that the instance is up in running, so goes to orphaned and then might be taken down by OSB again as part of the auto handling? .. I can't see why and who is deleting the pg instance again from Azure, where it was showed as up and running for some minutes? before it disappears again. No logs in OSB what itself is doing here?

And the client is a common community client without any knowledge about service endpoints. What is the github repo for this client? Are they any bugs reported here regarding that?

zhongyi-zhang commented 5 years ago

No need to divert attention to the client. Neither you nor OSBA set the parameter to create vnet rule. (You can confirm by looking into the ARM deployment template in the resource group before service-catalog calls OSBA to delete it.) And additionally it doesn't make sense the same scenario failed in northeu but succeeded in eastus. If you can still reproduce this, please file a support ticket from Azure Portal.

cforce commented 5 years ago

You are right .. i just was successful to do it on northeu on another subscription with osb 1.4.0 successfully. ..But what else? How can i hunt the issue down?

zhongyi-zhang commented 5 years ago

That’s fine if not reproducible. Maybe it was transient or the fix was just rolled out. You can continue following up on it once you hit it again. Can I close the issue for now?

cforce commented 5 years ago

It is reproducible on the original subscription i need to get it running. I have no idea why i don't see this issue on the other one.

cforce commented 5 years ago

svcat provision myapp --class azure-postgresql-10 --plan general-purpose -ndev --params-json '{"tags":{"microservice":"myapp","env":"dev"},"location": "northeurope","resourceGroup": "mygroup"}'

time="2019-03-01T22:54:30Z" level=error msg="error executing job; not submitting any follow-up tasks" error="error executing provisioning step \"setupDatabase\" for instance \"8b63011d-3c74-11e9-b172-46c8584e03e8\": error executing provisioning step: error starting transaction: pq: Client from Azure Virtual Networks is not allowed to access the server. Please make sure your Virtual Network is correctly configured." job=executeProvisioningStep taskID=8b6e6382-b5be-4f2c-9a7e-abe46c893877

Maybe it has something to with the fact that VNET Service Endpoints is enabled on the subnet of my aks and therefore OSB can't reach the server after created. image "Virtual Network service endpoint: A Virtual Network service endpoint is a subnet whose property values include one or more formal Azure service type names. In this article we are interested in the type name of Microsoft.Sql, which refers to the Azure service named SQL Database. When using service endpoints for Azure SQL Database, Outbound to Azure SQL Database Public IPs is required: Network Security Groups (NSGs) must be opened to Azure SQL Database IPs to allow connectivity. You can do this by using NSG Service Tags for Azure SQL Database." https://docs.microsoft.com/en-us/azure/virtual-network/security-overview#service-tags

https://social.msdn.microsoft.com/Forums/azure/en-US/911329c2-0dd4-4b64-b327-51a4522ac77e/fatal-server-is-not-configured-to-allow-ipv6-connections?forum=AzureDatabaseforPostgreSQL

Also saw that you made changes on vnet setup lately. https://github.com/Azure/open-service-broker-azure/commit/4797e909d6fa9cb5fe609c591aa95382365c55cd

cforce commented 5 years ago

The problem was indeed that OSB (and any other connect from the kube nodes) was blocked because they resource group vnet has virtual service endpoints enabled (especially for label "azure sql servers"), what blocks the traffic. I added the k8's subnet to the PostgreSQL as allowed using the "virtual networks" param what was introduced with OSB 1.5.0 for PostgreSQL lately.

The reason why it did also work without this param on another location is, that the PostgreSQL instance in that case was deployed into another vnet where virtual service endpoints for azure sql servers was not enabled (default)