Closed ldraney closed 1 year ago
Hey team! Please add your planning poker estimate with Zenhub @cris-oddball @EvanParish @justaskdavidb2 @k-macmillan @kalbfled @ldraney @nikolai-efimov
Maybe reach out to Felipe (ask Kyle) about this exact issue
datadog dockerfile:
FROM datadog/agent:latest
RUN \
DEBIAN_FRONTEND=noninteractive apt-get update \
&& apt-get install -y --no-install-recommends -o Dpkg::Options::="--force-confnew" \
ca-certificates jq \
&& apt-get clean
COPY ./certs/VA-Internal-S2-RCA1-v1.crt /usr/local/share/ca-certificates/VA-Internal-S2-RCA1-v1.crt
RUN /usr/sbin/update-ca-certificates
COPY ./scripts/datadog_expose_task_arn.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
the script it calls:
#!/bin/bash
# a workaround for datadog agent under-reporting statsd metrics on ecs fargate
# taken from https://github.com/DataDog/datadog-agent/issues/3159#issuecomment-688978425
set -e
set -o pipefail
if [[ -n "${ECS_FARGATE}" ]]; then
echo "datadog agent starting up in ecs!"
echo "trying to get task_arn from metadata endpoint..."
until [ -n "${task_arn}" ]; do
task_arn=$(curl --silent 169.254.170.2/v2/metadata | jq --raw-output '.TaskARN | split("/") | last')
done
echo "got it. starting up with task_arn $task_arn"
export DD_HOSTNAME=task-$task_arn
fi
/init
for this checkbox "Verify StatsD configurations in the application and ECS Fargate task definition" I have verified that we have set up our application as Datadog documentation indicated. I've sent a somewhat lengthy message to Kyle to clarify what information we seem to be missing, because it seems to me we are getting statsd information.
The documentation says:
By default, DogStatsD listens on UDP port 8125, so you need to bind this port to your host port when running the Agent in a container. If your StatsD metrics come from outside of localhostyou must set DD_DOGSTATSD_NON_LOCAL_TRAFFIC to true
I've confirmed this and even played with the DD_DOGSTATSD_NON_LOCAL_TRAFFIC to see if that makes a difference, but I don't see a difference in the logs at this point. I'm deploying this variable to api and celery to double check.
note added 5:08 EST: Maybe I should try and delete the above script, maybe datadog no longer has the gap the previous team noticed, or maybe this is helping. I'm going to delete this script and see what happens.
Something to note though that in all our environments (at first glance) we get this:
These are the names of the tasks. Is statsd what is showing these? For example send-inbound-sms
is in ./app/celery/service_callback_tasks.py:
@notify_celery.task(bind=True, name="send-inbound-sms", max_retries=5, default_retry_delay=300)
@statsd(namespace="tasks")
def send_inbound_sms_to_service(self, inbound_sms_id, service_id):
service_callback = get_service_inbound_sms_callback_api_for_service(service_id=service_id)
if not service_callback:
current_app.logger.error(
'could not send inbound sms to service "%s" because it does not have a callback API configured',
service_id
)
return
Is this what Kyle means by the incomplete information? I'm not sure how to test or verify if changes I make are adding what we are looking for.
Perhaps this is an example of missing information?
def send_callback_metrics(notification):
statsd_client.incr(f"callback.sns.{notification.status}")
if notification.sent_at:
statsd_client.timing_with_dates("callback.sns.elapsed-time", datetime.utcnow(), notification.sent_at)
if this were to showup in datadog I would expect the metric to literally begin with callback.sns
and be able to search and find that in the logs, but I cannot. This is just one example.
Another example of incomplete information might be in app/va/vetext/client.py:
class VETextClient:
STATSD_KEY = "clients.vetext"
TIMEOUT = 3.05
...
except requests.HTTPError as e:
self.logger.exception(e)
self.statsd.incr(f"{self.STATSD_KEY}.error.{e.response.status_code}")
This should send the exact string clients.vetext
into our Datadog logs, but that string hasn't appeared for at least the last month.
Maybe we are getting statsd in metrics rather than in the logs: Which is exactly what is set in app/va/vetext/client.py:
class VETextClient:
STATSD_KEY = "clients.vetext"
TIMEOUT = 3.05
def init_app(self, url: str, credentials: Credentials, logger: Logger, statsd: StatsdClient):
self.base_url = url
self.auth = HTTPBasicAuth(**credentials)
self.logger = logger
self.statsd = statsd
...
try:
start_time = monotonic()
response = requests.post(
f"{self.base_url}/mobile/push/send",
auth=self.auth,
json=payload,
timeout=self.TIMEOUT
)
self.logger.info("VEText response: %s", response.json() if response.ok else response.status_code)
response.raise_for_status()
except requests.HTTPError as e:
self.logger.exception(e)
self.statsd.incr(f"{self.STATSD_KEY}.error.{e.response.status_code}")
if e.response.status_code in [429, 500, 502, 503, 504]:
....
self.statsd.timing(f"{self.STATSD_KEY}.request_time", elapsed_time)
Another example, this time from queue_callback_strategy.py, is the callback.webhook.deliver_status.non_retryable_error, which does indeed show up in metrics one you specifically start typing it:
So it seems to me that we've confirmed this information does indeed show up in Datadog! Just in a tab we were not familiar with. If that's true, next step (outside the scope of this ticket) would need to read through, or make a list of the statsd information we want, and then make custom graphs/dashboards.
I had previously started a thread with the Datadog support team, and sent them the above notes to confirm my understanding.
The first lines of the email to the Datadog team:
Statsd and a python application on Fargate: statsD information should just be showing up in our metrics explorer, right? Please confirm our configuration and that I'm viewing the information correctly:
I then proceeded to share examples from our code and what we are seeing in our dashboard.
He responded:
DogStatsD does have a stats mode in which you can see which metrics are the most processed. To enable this mode:
He then proceeded to explain how to set it up.
This feature is outside the scope of this ticket because he seems to be saying "yes you can see statsd information, but you can get additional details about which stats are coming in most often", which may not even be something we necessarily want.
Again, it seems to me that we've confirmed statsd is in Datadog! Any statsd information we need is likely already available, so understanding which data we want to create metrics and dashboards from is outside the scope of this ticket.
@ldraney will we create a new ticket to determine what metrics and dashboards we need, or will we just ad hoc that until we realize we need something?
@cris-oddball That's a question for the team, I think. Kyle has already expressed that he wants an alert on sms/email time, so I started a System Reliability Dashboard based on what I've learned during this statsd ticket. Perhaps a spike to list out what metrics are available from statsd, and then to determine which we want in a system reliability dashboard, and then a ticket to appropriately configuring the alerts, are appropriate next steps.
User Story - Business Need
As a member of the VA Notify team, I need to be able to see StatsD information in Datadog because it helps to monitor our application's performance and troubleshoot issues effectively.
User Story(ies)
As a Software Engineer
I want StatsD information to be properly integrated with Datadog
So that I can monitor and analyze key metrics, aiding in proactive troubleshooting and performance enhancements
Additional Info and Resources
The relevant StatsD configurations seem to be correctly set in the application's
config.py
. StatsD is enabled and should be publishing metrics to the port 8125. Additionally, the task definition for ECS Fargate correctly maps the container and host port to 8125. Despite these configurations, StatsD metrics are not visible in Datadog.Code from
app/config.py
:Code from ECS Fargate task definition:
Engineering Checklist
Helpful tips
Acceptance Criteria
QA Considerations
For QA to populate. Leave blank if QA is not applicable on this ticket.
Potential Dependencies
Out of Scope