Closed alex-fedotyev closed 4 years ago
We did a lot of this type of metadata collection in HubbleStack, which I worked on in my previous job. It can probably be improved (I didn't actually write this code) with environment variable checks and the like. Including here just for reference. https://github.com/hubblestack/hubble/blob/develop/hubblestack/extmods/grains/cloud_details.py
I'd be happy to lead the POC on this with the Python agent, once roadmap discussions have happened.
It can probably be improved (I didn't actually write this code) with environment variable checks and the like
It would be great if we didn't have to make a network request for each cloud provider type, by detecting which cloud we're running in by some other means first. e.g. for EC2: https://serverfault.com/questions/462903/how-to-know-if-a-machine-is-an-ec2-instance. If nothing else, configuration to specify the cloud provider, which might be "none" to disable sniffing altogether.
Agreed. That's the primary way I would improve the linked code. :) It's just a useful reference for what kind of data is available and how to access each metadata endpoint for each cloud provider.
Following https://github.com/elastic/ecs/pull/816 all fields are now in ECS, some still to be included in a release.
I've taken the liberty of changing cloud.availability.zone
to cloud.availability_zone
. I assume that was a typo.
Intake for these will be available as of 7.8 and is now available in nightly snapshots.
@basepi a POC and guidelines for all agents to follow when implementing collection of this information would be great! I'll follow up on prioritization.
It would be great if we didn't have to make a network request for each cloud provider type, by detecting which cloud we're running in by some other means first. e.g. for EC2: https://serverfault.com/questions/462903/how-to-know-if-a-machine-is-an-ec2-instance. If nothing else, configuration to specify the cloud provider, which might be "none" to disable sniffing altogether.
After some initial investigation, I don't think there's a reliable way for us to detect the cloud provider without making network requests. If you go through the above linked thread for EC2, you'll find that there are endless edge cases (well-documented in this answer) and it seems that the most reliable way is indeed to hit the metadata server.
Unfortunately if the metadata server isn't there, then we have to wait to timeout. On the bright side, we should only need to do this once, on startup. We can also be pretty aggressive with our timeouts since the metadata server should be very low latency. (Cloud metadata timeout will also be configurable, in case we're too aggressive in some cases.)
My plan is to provide configuration to specify the cloud provider, as recommended by @axw above, including the ability to disable cloud metadata generation completely. I'll also implement some of the low-hanging fruit checks to reduce the blind checks as much as possible.
My solution is code complete here. I still need to add some tests, and do some manual end to end testing, but it's working on AWS, Azure, and GCP.
Notice that there is no provider guessing. I have found no consistent method to detect provider outside of querying the metadata services.
In fact, even those AWS methods defined in the serverfault thread don't work. My AWS machine doesn't have amazonaws.com
in hostname -d
.
According to Amazon, the best way is to hit the metadata server. You can query the system's UUID but it's not guaranteed to be accurate because there's nothing stopping non-EC2 servers from starting their UUID with ec2
.
For Azure, we could check against Azure's IP blocks, but that's a changing list we don't want to have to maintain.
dmidecode
is one place we could probably get the required information for Azure (and maybe for GCP, I haven't checked), but it requires sudo
which we won't (or, at least, shouldn't) have access to.
Luckily, all of the metadata services rely on non-routable addresses which fail immediately in my testing. So using trial and error should add effectively no overhead. (Even if it did, it only needs to happen once.)
Would you gather the cloud metadata in a blocking or async way on startup? If async, what should we do when we want to send events to APM Server before the metadata is available? Some options:
In the python agent, it's blocking when we're setting up the transport thread. In my testing these metadata services are all local to the box and extremely fast, I expect the overhead to be effectively zero.
Doing it synchronously sounds like a good tradeoff then given the complexity of doing it async. I expect the Node.js agent only having the option to do it async though.
Perhaps. I would probably recommend the Delay intake API requests until with have computed the metadata (queuing events in the mean time)
in that case.
The python implementation is complete and tested. We ended up doing this work in the transport background thread. No issues with race conditions, as the send queue is ready before the thread starts, but the thread won't start processing the queue until after the metadata generation is complete. Effectively the Delay intake API requests until with have computed the metadata (queuing events in the mean time)
option.
@elastic/apm-agent-devs The python solution for this metadata is complete, and can be used as a reference. Please open issues for this for each of your teams. I think we're targeting having this available (even if the UI isn't there yet) for 7.9.0.
elastic/kibana#70465 has been opened to collect some of these fields in the APM telemetry tasks.
@elastic/apm-agent-devs We found a bug in my implementation of AWS metadata collection. Turns out if you try to PUT against the token endpoint on a docker container in AWS Elastic Beanstalk, it fails with a ReadTimeout. So that token request needs to have exception handling, and short timeouts (with no retries). See my fix here: https://github.com/elastic/apm-agent-python/pull/884
Telemetry support in APM UI for sending AZ, provider, and region is shipping in 7.9 (elastic/kibana#71008) and will be available in the telemetry cluster mapping once elastic/telemetry#393 is complete.
7.9 milestone met for first implementation, all agents are expected to follow by 7.10 or sooner.
Description of the issue
There are multiple use cases where knowing the cloud provider would help our users and ourselves:
Agents would need to report following optional fields (based on ECS):
APM server can currently report this data for co-located applications (it needs to be installed on the same host). This is code which collects cloud metadata: https://github.com/elastic/beats/tree/0ef472268ea41881f81accabe2af6cfb72eef682/libbeat/processors/add_cloud_metadata
Agent spec PR: https://github.com/elastic/apm/pull/290
Agent Issues