Open bryanheo opened 2 years ago
we have not seen this kind of issue before, could you send you playbook (.tf) file which is used? So that we can try to reproduce on our end.
@suhasbshekar the error does not always happens but it sometime happens with other error messages like below In addition, when we deploy CVO HA cluster, it always takes 35 minutes. Is it normal?
Could you let me know the safe way to upload the files so that you can investigate it?
Error 1
╷
│ Error: Post "https://netapp-cloud-account.auth0.com/oauth/token": dial tcp: lookup netapp-cloud-account.auth0.com on 127.0.0.1:53: read udp 127.0.0.1:57538->127.0.0.1:53: read: connection refused
│
│
╵
Error 2
│ Error: Post "https://cloudmanager.cloud.netapp.com/occm/api/aws/ha/working-environments": dial tcp: lookup cloudmanager.cloud.netapp.com on 127.0.0.1:53: read udp 127.0.0.1:54913->127.0.0.1:53: read: connection refused
│
│ with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│ on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│ 1: resource "netapp-cloudmanager_cvo_aws" "this" {
│
Error 3
╷
│ Error: code: 500, message: {"message":"Server Fault","causeMessage":"ConnectException: Connection refused (Connection refused)"}
│
│ with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│ on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│ 1: resource "netapp-cloudmanager_cvo_aws" "this" {
│
╵
Error 4
Error: code: 400, message: Failure received for messageId JDxc6CJu with context . Failure message: occm: Name or service not known
│
│ with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│ on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│ 1: resource "netapp-cloudmanager_cvo_aws" "this" {
│
╵
Error 5
╷
│ Error: code: 400, message: Failure received for messageId Va9yIR5c with context . Failure message: {"message":"Connection refused: occm/10.5.20.4:80","cause":null,"stackTrace":[{"methodName":"applyOrElse","fileName":"MessageDispatcherActor.scala","lineNumber":96,"className":"com.cloudmanager.messagepoller.poller.actor.MessageDispatcherBehavior$$anonfun$handleMessage$3","nativeMethod":false},{"methodName":"applyOrElse","fileName":"MessageDispatcherActor.scala","lineNumber":82,"className":"com.cloudmanager.messagepoller.poller.actor.MessageDispatcherBehavior$$anonfun$handleMessage$3","nativeMethod":false},{"methodName":"recover","fileName":"Try.scala","lineNumber":233,"className":"scala.util.Failure","nativeMethod":false},{"methodName":"run","fileName":"Promise.scala","lineNumber":450,"className":"scala.concurrent.impl.Promise$Transformation","nativeMethod":false},{"methodName":"processBatch","fileName":"BatchingExecutor.scala","lineNumber":55,"className":"akka.dispatch.BatchingExecutor$AbstractBatch","nativeMethod":false},{"methodName":"$anonfun$run$1","fileName":"BatchingExecutor.scala","lineNumber":92,"className":"akka.dispatch.BatchingExecutor$BlockableBatch","nativeMethod":false},{"methodName":"apply","fileName":"JFunction0$mcV$sp.scala","lineNumber":18,"className":"scala.runtime.java8.JFunction0$mcV$sp","nativeMethod":false},{"methodName":"withBlockContext","fileName":"BlockContext.scala","lineNumber":94,"className":"scala.concurrent.BlockContext$","nativeMethod":false},{"methodName":"run","fileName":"BatchingExecutor.scala","lineNumber":92,"className":"akka.dispatch.BatchingExecutor$BlockableBatch","nativeMethod":false},{"methodName":"run","fileName":"AbstractDispatcher.scala","lineNumber":47,"className":"akka.dispatch.TaskInvocation","nativeMethod":false},{"methodName":"exec","fileName":"ForkJoinExecutorConfigurator.scala","lineNumber":47,"className":"akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask","nativeMethod":false},{"methodName":"doExec","fileName":"ForkJoinTask.java","lineNumber":289,"className":"java.util.concurrent.ForkJoinTask","nativeMethod":false},{"methodName":"runTask","fileName":"ForkJoinPool.java","lineNumber":1056,"className":"java.util.concurrent.ForkJoinPool$WorkQueue","nativeMethod":false},{"methodName":"runWorker","fileName":"ForkJoinPool.java","lineNumber":1692,"className":"java.util.concurrent.ForkJoinPool","nativeMethod":false},{"methodName":"run","fileName":"ForkJoinWorkerThread.java","lineNumber":175,"className":"java.util.concurrent.ForkJoinWorkerThread","nativeMethod":false}],"localizedMessage":"Connection refused: occm/10.5.20.4:80","suppressed":[]}
│
│ with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│ on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│ 1: resource "netapp-cloudmanager_cvo_aws" "this" {
yes, sometimes it will take 35 mins or more, but we test with demo version OR simple inputs, it depends on the complexity of the various inputs used.
it can reach 35 minutes for HA. is that issue reproducible? the 504? in that specific case seems that your connector was restarted due to health failures
@edarzi 504 error happens during mediator is created. I am trying to debug the issue but Cloud Manager timeline does not show the error and the CVO clusters are successfully created after the error. In order to update TF state file, I have to destroy the CVOs via CloudFormation and redeploy through TF again. Is there any ways to investigate it. How can I check the connector was restarted during the deployment?
Could you let us know how to import netapp-cloudmanager_cvo_aws in TF state file as well?
@edarzi @suhasbshekar as required, I have created NetApp support case (2009274344) and I uploaded the playbook file on the case. We are using a connector policy as guided by NetApp (https://docs.netapp.com/us-en/cloud-manager-setup-admin/reference-permissions-aws.html) Could you have a look?
Could you let us know how to import netapp-cloudmanager_cvo_aws in TF state file as well?
https://registry.terraform.io/providers/NetApp/netapp-cloudmanager/latest/docs/data-sources/cvo_aws
@edarzi @lonico we still have the same issue and we are trying to import the resources rather than deleting CVO through CloudFormation. Could we import the CVO resources with 'terraform import' rather than using data source?
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Creating...
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Still creating... [10s elapsed]
╷
│ Error: code: 400, message: {"message":"The name netappamtnuse1pri is already used by another working environment. Please use another one.","causeMessage":"BadRequestException: The name netappamtnuse1pri is already used by another working environment. Please use another one."}
│
│ with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│ on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│ 1: resource "netapp-cloudmanager_cvo_aws" "this" {
│
╵
moonyoung.heo@C02C35ZVMD6T ap-netapp-np % terraform import module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this VsaWorkingEnvironment-xxxxx
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Importing from ID "VsaWorkingEnvironment-xxxxx"...
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Import prepared!
Prepared netapp-cloudmanager_cvo_aws for import
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Refreshing state... [id=VsaWorkingEnvironment-xxxxx]
╷
│ Error: code: 400, message: Missing X-Agent-Id header
│
│
╵
No we don't support importing a connector. The APIs do not allow us to fetch enough information.
It would be better if Cloud Manager could provide an API to create a connector, rather than us having to go through the Cloud Provider APIs and Cloud Manager APIs. This introduces a level of complexity.
@lonico @edarzi @suhasbshekar the issue keeps happening from Terraform Enterprise and local laptop. I cannot see any error on the timeline of Cloud manager. The CVO are successfully deployed in AWS while the error occurs but I have to redeploy the CVOs due to the inconsistent TF state file. Do you have any methods to find out why 504 error happens?
@bryanheo Since it looks like a Cloud Manager issue, I would suggest you open a case to track this issue.
@suhasbshekar @edarzi Should we retry on such an error? How many times? Can we be more specific about the context?
@lonico Thank you for your suggestion. I am not sure whether this issue is related to Cloud Manager or not because I did not have 504 error when I deployed CVO by Cloud Manager manually. Anyway, as you suggested I will create a case on NetApp support site.
Will need some more details in order to track and debug. Ping me at erand@netapp.com
@edarzi Thank you for your reply. As mentioned earlier, I have uploaded our entire TF code on NetApp support case (2009274344) and could you have a look? If you cannot access the case, please let me know
I will need logs from the connector
From: bryanheo @.> Sent: Saturday, September 3, 2022 12:47:13 AM To: NetApp/terraform-provider-netapp-cloudmanager @.> Cc: Darzi, Eran @.>; Mention @.> Subject: Re: [NetApp/terraform-provider-netapp-cloudmanager] 504 error during deployment or destroying resources (Issue #116)
NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
@edarzihttps://github.com/edarzi Thank you for your reply. As mentioned earlier, I have uploaded our entire TF code on NetApp support case (2009274344) and could you have a look? If you cannot access the case, please let me know
— Reply to this email directly, view it on GitHubhttps://github.com/NetApp/terraform-provider-netapp-cloudmanager/issues/116#issuecomment-1235919734, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALB4HM3VTMJLGUOE4XXNLPDV4JYWDANCNFSM56DU5WXA. You are receiving this because you were mentioned.Message ID: @.***>
@edarzi could you let me know how to get the logs from the connector? Could we use AutoSupport?
you can download the auto support file from the Cloud manager UI and send it to my mail please you can also send me the service manager log from: /opt/application/netapp/cloudmanager/log/service-manager.log
@edarzi Any update on this. We're attempting to add a retry. But without understanding the root cause, we don't know if a retry would help, or how many times / how long we should try.
@edarzi I have sent email with the auto support file from the Cloud manager UI but the file size is about 30MB and it has been rejected by your mail server. Could you let me know where to upload the 30MB file? (NetApp Support ticket does not allow autosupport 7z file either) In addition, I do not know how to get /opt/application/netapp/cloudmanager/log/service-manager.log. Could you let me know how to get the log file?
We released 22.9.0 yesterday (9/8). It provides some retries on 504 errors. Can you see if it helps?
@lonico I have deployed NetApp CVO clusters several times with 22.9.0 and I have not seen 504 error so far. It looks better than previous version. I will let you know if we have the error again
That's great news. As you know, we added a retry on 504. You could see it in the logs by setting TF_LOG to DEBUG or TRACE. I'm curious to see if it always work on the first retry (which would indicate some sort of transient issue) or if we need to retry several times.
Hi @lonico
I`m Gabor with NetApp Tech Support and have been working with the customer on this issue.
@bryanheo as discussed, for me to investigate from the cloud manager end, we would need to have logging verbosity enabled in the cloud manager. This might allow us to see how long it takes for cm to process the requests and we can proactively enhance the software to work better with terraform.
Once done, simply trigger a cloud manager auto support and I will review it.
Hi @lonico I thought the issue has been resolve but it has happened again. As mentioned above, NetApp AWS resources have been successfully created but with 504 error, Terraform State has not been updated. In other words, we have to redeploy the cluster. Could you investigate it?
Hello
We are deploying NetApp CVO in AWS through Terraform and sometime we have 504 error during deployment as shown below but the actual resources are successfully created in AWS. Due to the error, TF state file is not updated and we have to re-deploy TF (destroying the existing AWS resources by CloudFormation and redeploying by Terraform Enterprise). If we re-deploy TF then it works ok. It also sometime happens when we destroy TF resources. Is it a known issue or Is it something you can investigate it?
504 error during the deployment
504 error during destroying TF resources
Regards Moon