504 error during deployment or destroying resources

bryanheo commented 2 years ago

Hello

We are deploying NetApp CVO in AWS through Terraform and sometime we have 504 error during deployment as shown below but the actual resources are successfully created in AWS. Due to the error, TF state file is not updated and we have to re-deploy TF (destroying the existing AWS resources by CloudFormation and redeploying by Terraform Enterprise). If we re-deploy TF then it works ok. It also sometime happens when we destroy TF resources. Is it a known issue or Is it something you can investigate it?

504 error during the deployment Screenshot 2022-08-10 at 09 35 24

504 error during destroying TF resources

Error: code: 504, message: 
│ 
│   with module.usw2.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {

Regards Moon

suhasbshekar commented 2 years ago

we have not seen this kind of issue before, could you send you playbook (.tf) file which is used? So that we can try to reproduce on our end.

bryanheo commented 2 years ago

@suhasbshekar the error does not always happens but it sometime happens with other error messages like below In addition, when we deploy CVO HA cluster, it always takes 35 minutes. Is it normal?

Could you let me know the safe way to upload the files so that you can investigate it?

Error 1

╷
│ Error: Post "https://netapp-cloud-account.auth0.com/oauth/token": dial tcp: lookup netapp-cloud-account.auth0.com on 127.0.0.1:53: read udp 127.0.0.1:57538->127.0.0.1:53: read: connection refused
│ 
│ 
╵

Error 2

│ Error: Post "https://cloudmanager.cloud.netapp.com/occm/api/aws/ha/working-environments": dial tcp: lookup cloudmanager.cloud.netapp.com on 127.0.0.1:53: read udp 127.0.0.1:54913->127.0.0.1:53: read: connection refused
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {
│

Error 3

╷
│ Error: code: 500, message: {"message":"Server Fault","causeMessage":"ConnectException: Connection refused (Connection refused)"}
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {
│ 
╵

Error 4

 Error: code: 400, message: Failure received for messageId JDxc6CJu with context . Failure message: occm: Name or service not known
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {
│ 
╵

Error 5

╷
│ Error: code: 400, message: Failure received for messageId Va9yIR5c with context . Failure message: {"message":"Connection refused: occm/10.5.20.4:80","cause":null,"stackTrace":[{"methodName":"applyOrElse","fileName":"MessageDispatcherActor.scala","lineNumber":96,"className":"com.cloudmanager.messagepoller.poller.actor.MessageDispatcherBehavior$$anonfun$handleMessage$3","nativeMethod":false},{"methodName":"applyOrElse","fileName":"MessageDispatcherActor.scala","lineNumber":82,"className":"com.cloudmanager.messagepoller.poller.actor.MessageDispatcherBehavior$$anonfun$handleMessage$3","nativeMethod":false},{"methodName":"recover","fileName":"Try.scala","lineNumber":233,"className":"scala.util.Failure","nativeMethod":false},{"methodName":"run","fileName":"Promise.scala","lineNumber":450,"className":"scala.concurrent.impl.Promise$Transformation","nativeMethod":false},{"methodName":"processBatch","fileName":"BatchingExecutor.scala","lineNumber":55,"className":"akka.dispatch.BatchingExecutor$AbstractBatch","nativeMethod":false},{"methodName":"$anonfun$run$1","fileName":"BatchingExecutor.scala","lineNumber":92,"className":"akka.dispatch.BatchingExecutor$BlockableBatch","nativeMethod":false},{"methodName":"apply","fileName":"JFunction0$mcV$sp.scala","lineNumber":18,"className":"scala.runtime.java8.JFunction0$mcV$sp","nativeMethod":false},{"methodName":"withBlockContext","fileName":"BlockContext.scala","lineNumber":94,"className":"scala.concurrent.BlockContext$","nativeMethod":false},{"methodName":"run","fileName":"BatchingExecutor.scala","lineNumber":92,"className":"akka.dispatch.BatchingExecutor$BlockableBatch","nativeMethod":false},{"methodName":"run","fileName":"AbstractDispatcher.scala","lineNumber":47,"className":"akka.dispatch.TaskInvocation","nativeMethod":false},{"methodName":"exec","fileName":"ForkJoinExecutorConfigurator.scala","lineNumber":47,"className":"akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask","nativeMethod":false},{"methodName":"doExec","fileName":"ForkJoinTask.java","lineNumber":289,"className":"java.util.concurrent.ForkJoinTask","nativeMethod":false},{"methodName":"runTask","fileName":"ForkJoinPool.java","lineNumber":1056,"className":"java.util.concurrent.ForkJoinPool$WorkQueue","nativeMethod":false},{"methodName":"runWorker","fileName":"ForkJoinPool.java","lineNumber":1692,"className":"java.util.concurrent.ForkJoinPool","nativeMethod":false},{"methodName":"run","fileName":"ForkJoinWorkerThread.java","lineNumber":175,"className":"java.util.concurrent.ForkJoinWorkerThread","nativeMethod":false}],"localizedMessage":"Connection refused: occm/10.5.20.4:80","suppressed":[]}
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {

suhasbshekar commented 2 years ago

yes, sometimes it will take 35 mins or more, but we test with demo version OR simple inputs, it depends on the complexity of the various inputs used.

edarzi commented 2 years ago

it can reach 35 minutes for HA. is that issue reproducible? the 504? in that specific case seems that your connector was restarted due to health failures

bryanheo commented 2 years ago

@edarzi 504 error happens during mediator is created. I am trying to debug the issue but Cloud Manager timeline does not show the error and the CVO clusters are successfully created after the error. In order to update TF state file, I have to destroy the CVOs via CloudFormation and redeploy through TF again. Is there any ways to investigate it. How can I check the connector was restarted during the deployment?

Screenshot 2022-08-17 at 11 19 05

bryanheo commented 2 years ago

Could you let us know how to import netapp-cloudmanager_cvo_aws in TF state file as well?

bryanheo commented 2 years ago

@edarzi @suhasbshekar as required, I have created NetApp support case (2009274344) and I uploaded the playbook file on the case. We are using a connector policy as guided by NetApp (https://docs.netapp.com/us-en/cloud-manager-setup-admin/reference-permissions-aws.html) Could you have a look?

edarzi commented 2 years ago

Could you let us know how to import netapp-cloudmanager_cvo_aws in TF state file as well?

https://registry.terraform.io/providers/NetApp/netapp-cloudmanager/latest/docs/data-sources/cvo_aws

bryanheo commented 2 years ago

@edarzi @lonico we still have the same issue and we are trying to import the resources rather than deleting CVO through CloudFormation. Could we import the CVO resources with 'terraform import' rather than using data source?

module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Creating...
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Still creating... [10s elapsed]
╷
│ Error: code: 400, message: {"message":"The name netappamtnuse1pri is already used by another working environment. Please use another one.","causeMessage":"BadRequestException: The name netappamtnuse1pri is already used by another working environment. Please use another one."}
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {
│ 
╵
moonyoung.heo@C02C35ZVMD6T ap-netapp-np % terraform import module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this VsaWorkingEnvironment-xxxxx
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Importing from ID "VsaWorkingEnvironment-xxxxx"...
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Import prepared!
  Prepared netapp-cloudmanager_cvo_aws for import
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Refreshing state... [id=VsaWorkingEnvironment-xxxxx]
╷
│ Error: code: 400, message: Missing X-Agent-Id header
│ 
│ 
╵

lonico commented 2 years ago

No we don't support importing a connector. The APIs do not allow us to fetch enough information.

It would be better if Cloud Manager could provide an API to create a connector, rather than us having to go through the Cloud Provider APIs and Cloud Manager APIs. This introduces a level of complexity.

bryanheo commented 2 years ago

@lonico @edarzi @suhasbshekar the issue keeps happening from Terraform Enterprise and local laptop. I cannot see any error on the timeline of Cloud manager. The CVO are successfully deployed in AWS while the error occurs but I have to redeploy the CVOs due to the inconsistent TF state file. Do you have any methods to find out why 504 error happens?

lonico commented 2 years ago

@bryanheo Since it looks like a Cloud Manager issue, I would suggest you open a case to track this issue.

@suhasbshekar @edarzi Should we retry on such an error? How many times? Can we be more specific about the context?

bryanheo commented 2 years ago

@lonico Thank you for your suggestion. I am not sure whether this issue is related to Cloud Manager or not because I did not have 504 error when I deployed CVO by Cloud Manager manually. Anyway, as you suggested I will create a case on NetApp support site.

edarzi commented 2 years ago

Will need some more details in order to track and debug. Ping me at erand@netapp.com

bryanheo commented 2 years ago

@edarzi Thank you for your reply. As mentioned earlier, I have uploaded our entire TF code on NetApp support case (2009274344) and could you have a look? If you cannot access the case, please let me know

edarzi commented 2 years ago

I will need logs from the connector

From: bryanheo @.> Sent: Saturday, September 3, 2022 12:47:13 AM To: NetApp/terraform-provider-netapp-cloudmanager @.> Cc: Darzi, Eran @.>; Mention @.> Subject: Re: [NetApp/terraform-provider-netapp-cloudmanager] 504 error during deployment or destroying resources (Issue #116)

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.

@edarzihttps://github.com/edarzi Thank you for your reply. As mentioned earlier, I have uploaded our entire TF code on NetApp support case (2009274344) and could you have a look? If you cannot access the case, please let me know

— Reply to this email directly, view it on GitHubhttps://github.com/NetApp/terraform-provider-netapp-cloudmanager/issues/116#issuecomment-1235919734, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALB4HM3VTMJLGUOE4XXNLPDV4JYWDANCNFSM56DU5WXA. You are receiving this because you were mentioned.Message ID: @.***>

bryanheo commented 2 years ago

@edarzi could you let me know how to get the logs from the connector? Could we use AutoSupport?

edarzi commented 2 years ago

you can download the auto support file from the Cloud manager UI and send it to my mail please you can also send me the service manager log from: /opt/application/netapp/cloudmanager/log/service-manager.log

lonico commented 2 years ago

@edarzi Any update on this. We're attempting to add a retry. But without understanding the root cause, we don't know if a retry would help, or how many times / how long we should try.

bryanheo commented 2 years ago

@edarzi I have sent email with the auto support file from the Cloud manager UI but the file size is about 30MB and it has been rejected by your mail server. Could you let me know where to upload the 30MB file? (NetApp Support ticket does not allow autosupport 7z file either) In addition, I do not know how to get /opt/application/netapp/cloudmanager/log/service-manager.log. Could you let me know how to get the log file?

lonico commented 2 years ago

We released 22.9.0 yesterday (9/8). It provides some retries on 504 errors. Can you see if it helps?

bryanheo commented 2 years ago

@lonico I have deployed NetApp CVO clusters several times with 22.9.0 and I have not seen 504 error so far. It looks better than previous version. I will let you know if we have the error again

lonico commented 2 years ago

That's great news. As you know, we added a retry on 504. You could see it in the logs by setting TF_LOG to DEBUG or TRACE. I'm curious to see if it always work on the first retry (which would indicate some sort of transient issue) or if we need to retry several times.

laagabi commented 2 years ago

Hi @lonico

I`m Gabor with NetApp Tech Support and have been working with the customer on this issue.

@bryanheo as discussed, for me to investigate from the cloud manager end, we would need to have logging verbosity enabled in the cloud manager. This might allow us to see how long it takes for cm to process the requests and we can proactively enhance the software to work better with terraform.

Once done, simply trigger a cloud manager auto support and I will review it.

bryanheo commented 2 years ago

Hi @lonico I thought the issue has been resolve but it has happened again. As mentioned above, NetApp AWS resources have been successfully created but with 504 error, Terraform State has not been updated. In other words, we have to redeploy the cluster. Could you investigate it?

NetApp / terraform-provider-netapp-cloudmanager

504 error during deployment or destroying resources #116