Unable to collect metrics for all VMs

rithinskaria commented 10 months ago

I have this deployed in my environment, however, I am not able to pull metrics from all VMs. I can see couple of user VMs, HCX, and NSX VMs. The environment has around 20+ VMs and I am getting 10 only. I tried giving credentials of AVS directly to the telegaf.conf, even after that, I can't collect the full metrics. I tested a different configuration in telegraf (using vm_metrics_include) and with InfluxDB, I was able to collect all metrics.

Since I am using Azure Managed Grafana, I can't reach the Influx DB over private IP as the Managed Grafana doesn't support managed private endpoint to access VMs. I deployed another Grafana on-premises and with InfluxDB datasource, it works fine. From an observability standpoint, managing two Grafana doesn't make sense. If it comes in Azure Monitor, I can easily parse and transform rather than writing complex InfluxDB queries.

In my current configuration, I hardcoded the AVS resource ID, region, and credentials. Any pointers? When I run telegraf, I can see logs where it states "Found 11 metrics for vm-01" and this vm-01 never reached the CSV.

Any idea how I can fix this?

adeturner commented 9 months ago

Significant caveats I'm just a passer by, not related to MS or this repo, and I'm new to telegraf. You may already know what I'm saying but either way it was a good learning for me :)

So when you say "can see 10 vms only", its not clear what they are

Please confirm that you are using the Worker files from the repo unchanged
Q. have you installed the Worker node on an Azure VM and not a VMWare VM
Suggest you post telegraf.conf in a gist
Post the telegraf log file in a second gist
a diagram showing your network topology would help, color coding VMs discovered and those that are not

Metrics are collected in two ways:

NSX using the python service
VMs using the telegraf plugin for vsphere

For NSX:

main.py collects the VM info here, You can put debug in here to show the nodes being identified and/or not, and report back

For vsphere:

vsphere monitoring is configured here. and the plugin reference is here.

I do note that the syntax is different in the azure file compared to the readme ([""] vs ["/host/**"]. You could specify the config per the plugin read me and see if it makes a difference

Hope that helps

khensler commented 9 months ago

I'm a little unclear on what you are trying to do. The CSV is only used for NSX metrics. All of the vsphere metrics are read directly by telegraf via the vsphere plugin and then forwarded to azure monitor via the output plugin. The NSX metrics are written to CSV and the read by the input plugin and sent to azure monitor via the output plugin. None of the NSX metrics include any VM metrics. Only Edge metrics.

rithinskaria commented 8 months ago

@khensler But I do see some VM names in the CSV besides the HCX and NSX VM names. So I was under the impression that, all the VM metrics including the VM names will be written to the CSV.

Anyways, I still can't find all VMs in Azure Monitor.

rithinskaria commented 8 months ago

@adeturner Thanks, I am also new to Telegraf and exploring options to retrieve metrics to Azure Managed Grafana. This is running in an Azure VM which has access to the AVS environment and traffic is allowed via firewall.

I will post the architecture and config.

Azure / azure-vmware-solution

Unable to collect metrics for all VMs #15