elastic / apm

Elastic Application Performance Monitoring - resources and general issue tracking for Elastic APM.
https://www.elastic.co/apm
Apache License 2.0
385 stars 114 forks source link

Agents should collect cloud identification metadata #256

Closed alex-fedotyev closed 4 years ago

alex-fedotyev commented 4 years ago

Description of the issue

There are multiple use cases where knowing the cloud provider would help our users and ourselves:

Agents would need to report following optional fields (based on ECS):

APM server can currently report this data for co-located applications (it needs to be installed on the same host). This is code which collects cloud metadata: https://github.com/elastic/beats/tree/0ef472268ea41881f81accabe2af6cfb72eef682/libbeat/processors/add_cloud_metadata

Agent spec PR: https://github.com/elastic/apm/pull/290

Agent Issues

basepi commented 4 years ago

We did a lot of this type of metadata collection in HubbleStack, which I worked on in my previous job. It can probably be improved (I didn't actually write this code) with environment variable checks and the like. Including here just for reference. https://github.com/hubblestack/hubble/blob/develop/hubblestack/extmods/grains/cloud_details.py

basepi commented 4 years ago

I'd be happy to lead the POC on this with the Python agent, once roadmap discussions have happened.

axw commented 4 years ago

It can probably be improved (I didn't actually write this code) with environment variable checks and the like

It would be great if we didn't have to make a network request for each cloud provider type, by detecting which cloud we're running in by some other means first. e.g. for EC2: https://serverfault.com/questions/462903/how-to-know-if-a-machine-is-an-ec2-instance. If nothing else, configuration to specify the cloud provider, which might be "none" to disable sniffing altogether.

basepi commented 4 years ago

Agreed. That's the primary way I would improve the linked code. :) It's just a useful reference for what kind of data is available and how to access each metadata endpoint for each cloud provider.

graphaelli commented 4 years ago

Following https://github.com/elastic/ecs/pull/816 all fields are now in ECS, some still to be included in a release.

axw commented 4 years ago

I've taken the liberty of changing cloud.availability.zone to cloud.availability_zone. I assume that was a typo.

graphaelli commented 4 years ago

Intake for these will be available as of 7.8 and is now available in nightly snapshots.

@basepi a POC and guidelines for all agents to follow when implementing collection of this information would be great! I'll follow up on prioritization.

basepi commented 4 years ago

Tracking: https://github.com/elastic/apm-agent-python/issues/822

basepi commented 4 years ago

It would be great if we didn't have to make a network request for each cloud provider type, by detecting which cloud we're running in by some other means first. e.g. for EC2: https://serverfault.com/questions/462903/how-to-know-if-a-machine-is-an-ec2-instance. If nothing else, configuration to specify the cloud provider, which might be "none" to disable sniffing altogether.

After some initial investigation, I don't think there's a reliable way for us to detect the cloud provider without making network requests. If you go through the above linked thread for EC2, you'll find that there are endless edge cases (well-documented in this answer) and it seems that the most reliable way is indeed to hit the metadata server.

Unfortunately if the metadata server isn't there, then we have to wait to timeout. On the bright side, we should only need to do this once, on startup. We can also be pretty aggressive with our timeouts since the metadata server should be very low latency. (Cloud metadata timeout will also be configurable, in case we're too aggressive in some cases.)

My plan is to provide configuration to specify the cloud provider, as recommended by @axw above, including the ability to disable cloud metadata generation completely. I'll also implement some of the low-hanging fruit checks to reduce the blind checks as much as possible.

basepi commented 4 years ago

My solution is code complete here. I still need to add some tests, and do some manual end to end testing, but it's working on AWS, Azure, and GCP.

Notice that there is no provider guessing. I have found no consistent method to detect provider outside of querying the metadata services.

In fact, even those AWS methods defined in the serverfault thread don't work. My AWS machine doesn't have amazonaws.com in hostname -d.

According to Amazon, the best way is to hit the metadata server. You can query the system's UUID but it's not guaranteed to be accurate because there's nothing stopping non-EC2 servers from starting their UUID with ec2.

For Azure, we could check against Azure's IP blocks, but that's a changing list we don't want to have to maintain.

dmidecode is one place we could probably get the required information for Azure (and maybe for GCP, I haven't checked), but it requires sudo which we won't (or, at least, shouldn't) have access to.

Luckily, all of the metadata services rely on non-routable addresses which fail immediately in my testing. So using trial and error should add effectively no overhead. (Even if it did, it only needs to happen once.)

felixbarny commented 4 years ago

Would you gather the cloud metadata in a blocking or async way on startup? If async, what should we do when we want to send events to APM Server before the metadata is available? Some options:

basepi commented 4 years ago

In the python agent, it's blocking when we're setting up the transport thread. In my testing these metadata services are all local to the box and extremely fast, I expect the overhead to be effectively zero.

felixbarny commented 4 years ago

Doing it synchronously sounds like a good tradeoff then given the complexity of doing it async. I expect the Node.js agent only having the option to do it async though.

basepi commented 4 years ago

Perhaps. I would probably recommend the Delay intake API requests until with have computed the metadata (queuing events in the mean time) in that case.

basepi commented 4 years ago

The python implementation is complete and tested. We ended up doing this work in the transport background thread. No issues with race conditions, as the send queue is ready before the thread starts, but the thread won't start processing the queue until after the metadata generation is complete. Effectively the Delay intake API requests until with have computed the metadata (queuing events in the mean time) option.

basepi commented 4 years ago

@elastic/apm-agent-devs The python solution for this metadata is complete, and can be used as a reference. Please open issues for this for each of your teams. I think we're targeting having this available (even if the UI isn't there yet) for 7.9.0.

smith commented 4 years ago

elastic/kibana#70465 has been opened to collect some of these fields in the APM telemetry tasks.

basepi commented 4 years ago

@elastic/apm-agent-devs We found a bug in my implementation of AWS metadata collection. Turns out if you try to PUT against the token endpoint on a docker container in AWS Elastic Beanstalk, it fails with a ReadTimeout. So that token request needs to have exception handling, and short timeouts (with no retries). See my fix here: https://github.com/elastic/apm-agent-python/pull/884

smith commented 4 years ago

Telemetry support in APM UI for sending AZ, provider, and region is shipping in 7.9 (elastic/kibana#71008) and will be available in the telemetry cluster mapping once elastic/telemetry#393 is complete.

graphaelli commented 4 years ago

7.9 milestone met for first implementation, all agents are expected to follow by 7.10 or sooner.