Missing step in AddonTerraformTemplate

iohenkies commented 2 months ago

Hi all,

For the below we already had a support ticket at Microsoft. Their conclusion was that there are no issue with the Azure backplane regarding this specific issue, and it should be a problem in the code. They reproduced our situation, and had the exact same results.

The issue We're setting up managed Prometheus on some of our private AKS clusters, using the code provided here: https://github.com/Azure/prometheus-collector/tree/main/AddonTerraformTemplate

It seems to be missing one essential step, or in any case it doesn't work: the data collection endpoint and data collection rule aren't properly linked (I believe this is called the data collection rule association). See screenshot.

When this is not set properly, managed Prometheus is not working, and the ama-metrics pods on the cluster keep crashing.

We've tested this in different subscriptions with different clusters, with different versions of the Azure TerrAform providers even, the results are always the same. So there seems to be a step missing, in the code, Terraform provider, a combination of those, or something else. As said, the Microsoft support team had the same results with the same code.

When we manually select and save the proper endpoint, managed Prometheus starts working and pods stop crashing. We need to do this via the Terraform code though, and not manually.

Any help and input is greatly appreciated.

Thanks and regards! comm_issue

Sohamdg081992 commented 2 months ago

@iohenkies , just to confirm , the clusters are private linked clusters, right?

iohenkies commented 2 months ago

@Sohamdg081992 that's correct and probably good to add to the original issue. I'll add it.

Sohamdg081992 commented 2 months ago

@iohenkies Thanks! We are working on updating our docs to point to the private link setup after onboarding addon.

iohenkies commented 2 months ago

@Sohamdg081992 thanks. Does this mean that it is not possible to setup via code (in the same Terraform run), and that you always need to do the manual step afterwards, or am I misunderstanding?

iohenkies commented 2 months ago

Hi @Sohamdg081992, could you please clarify the above?

Sohamdg081992 commented 2 months ago

@iohenkies we have added a task in our sprint to add the Terraform setup code for private linked clusters and will work on it.

iohenkies commented 2 months ago

Thanks a lot, this was unclear to me. Will await an update.

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 5 days.

iohenkies commented 2 months ago

Still relevant. Devs are working on it or will be working on it.

Sohamdg081992 commented 2 months ago

@iohenkies I just created a PR to add the step. The PR description has all the screenshots of the linking. Please feel free to test with this update and let me know if this works for you.

iohenkies commented 2 months ago

Great @Sohamdg081992 ! Will test this week.

Sohamdg081992 commented 1 month ago

@iohenkies Thanks! Please feel free to test with the latest update in the PR. I made a commit today to create new DCE, if the cluster and workspace regions do not match.

Sohamdg081992 commented 1 month ago

@iohenkies did you get a chance to test the template?

iohenkies commented 1 month ago

My sincere apologies @Sohamdg081992, there have been some production outages last week. I will test it this week for sure.

Sohamdg081992 commented 1 month ago

Thanks @iohenkies !

iohenkies commented 1 month ago

Hi @Sohamdg081992, sorry for the delay.

With part of your code I was able to finally fix this for existing deployments. New deployments don't work yet. Let me explain a bit more about our environment.

For Terraform we're using a multi-level setup as explained below: https://dev.to/piotrgwiazda/main-module-approach-for-handling-multiple-environments-in-terraform-1oln

This separates environments and creates separate statefiles per environment.

To visualize, this is our directory structure: tree

Also important to know is that we can only deploy in a single region (company policy).

This being said, you can see that we can't use your code as provided. We need to move code around, make modules scalable, variablize all parts in the module and move them to the main or env directory, etc.

I'm providing you with this background, because the code as-is didn't work for us, but that might be because of the way we structured it. Finally the trick for existing deployment for us was to add this code in our module, which you provided:

resource "azurerm_monitor_data_collection_endpoint" "dce_privatelink" {
  kind                          = var.dcedcr_kind
  location                      = var.region
  name                          = var.msprom_privatelink_name
  resource_group_name           = var.resource_group_name
  public_network_access_enabled = var.public_network_access_enabled
  tags                          = local.common_tags
}

resource "azurerm_monitor_data_collection_rule_association" "dcra_privatelink" {
  data_collection_endpoint_id = azurerm_monitor_data_collection_endpoint.dce_privatelink.id
  target_resource_id          = var.cluster_id
  depends_on = [
    azurerm_monitor_data_collection_endpoint.dce,
    azurerm_monitor_data_collection_rule.dcr,
  ]
}

All variables come from main or env.

As you can see, we removed the counts and simplified the data_collection_endpoint_id. Adding the count provided us with a couple of errors, such as that count.index has to be provided to the dcra_privatelink and finally I could not come up with a reason why it should be provided at all. But maybe I'm wrong?

Also, the useAzureMonitorPrivateLinkScope should be declared somewhere, which now didn't happen in the code. But once again, this could just as easily be caused by our code structure.

Anyway, this works perfectly for the existing environments. It associates the dcr with the dce, metrics start flowing, pods stop crashing. Unfortunately though, a brand new environment does not result in a working Managed prometheus via private link. Tomorrow I can do more troubleshooting.

iohenkies commented 1 month ago

Time was limited today and unfortunately I did not book any significant progress.

So the issue remains as it was yesterday: existing deployments can be altered with this code to create a working managed Prometheus via private endpoints. A complete new environment with this code results in no metrics (e.g. 'The query returned no results.' in the Prometheus explorer) and the crashing pods.

Could you expand a bit more on the options I took out now because Terraform was not able to apply with these options? So the counts in the different blocks and the useAzureMonitorPrivateLinkScope variable? For instance, what should be the default here?

Sohamdg081992 commented 1 month ago

@iohenkies I am unable to repro the issue for new setup. It is probably due to the multi-level setup. I was able to create a new deployment (cluster, DC objects, etc) with the code in the PR and have attached the screenshots for the new cluster creation. Can you please give more details on the error you are seeing for new environments.

The first resource is azurerm_monitor_data_collection_endpoint named private_link_dce. This resource creates a data collection endpoint for Azure Monitor. The count attribute is used to conditionally create this resource based on two conditions: whether the: The useAzureMonitorPrivateLinkScope variable is true and whether the location of the Kubernetes cluster is different from the location of the Azure Monitor workspace. If both conditions are met, the resource is created; otherwise, it is not.

The second resource is azurerm_monitor_data_collection_rule_association named dcraprivatelink. This resource associates a data collection rule with a target resource, in this case, the Kubernetes cluster. The count attribute conditionally creates this resource based on the useAzureMonitorPrivateLinkScope variable. The data_collection_endpoint_id attribute sets the ID of the data collection endpoint to be used. If the Kubernetes cluster's location matches the Azure Monitor workspace's location, it uses the default endpoint (azurerm_monitor_data_collection_endpoint.dce.id); otherwise, it uses the private link endpoint (azurerm_monitor_data_collection_endpoint.private_link_dce.id). The target_resource_id attribute specifies the ID of the Kubernetes cluster. The depends_on attribute ensures that this resource is created only after the specified dependencies are created.

iohenkies commented 1 month ago

Thanks @Sohamdg081992. I will start working on this next week on our next sprint and report back ASAP. Is there a possibility to contact you more directly, e.g. via Email, Slack or Discord? Only for this issue of course. If not, we'll just use this issue.

Sohamdg081992 commented 1 month ago

@iohenkies yes please contact me directly over email sohdasgupta@microsoft.com.

iohenkies commented 1 month ago

Much appreciated. Will work on this the coming week.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been stalled for 12 days with no activity.

iohenkies commented 3 weeks ago

This is not entirely resolved IMHO. There are some follow-up questions that I emailed @Sohamdg081992 about on 29-08 and 04-09. It would be nice if we could tackle those.

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 5 days.

iohenkies commented 2 weeks ago

Unfortunately still some unanswered questions.

github-actions[bot] commented 1 week ago

This issue is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 5 days.

iohenkies commented 1 week ago

Still relevant. Emailed again 27-09.

bragi92 commented 1 day ago

@iohenkies Sorry for the delay. I'm taking over this issue but unfortunately I do not have the questions that you had sent to Soham. Can you please send the email to c i p r o m e t h e u s [ a t ] m i c r o s o f t [ d o t ] c o m?

Azure / prometheus-collector

Missing step in AddonTerraformTemplate #942