department-of-veterans-affairs / notification-api

Notification API
MIT License
16 stars 9 forks source link

Datadog StatsD #1308

Closed ldraney closed 1 year ago

ldraney commented 1 year ago

User Story - Business Need

As a member of the VA Notify team, I need to be able to see StatsD information in Datadog because it helps to monitor our application's performance and troubleshoot issues effectively.

User Story(ies)

As a Software Engineer
I want StatsD information to be properly integrated with Datadog
So that I can monitor and analyze key metrics, aiding in proactive troubleshooting and performance enhancements

Additional Info and Resources

The relevant StatsD configurations seem to be correctly set in the application's config.py. StatsD is enabled and should be publishing metrics to the port 8125. Additionally, the task definition for ECS Fargate correctly maps the container and host port to 8125. Despite these configurations, StatsD metrics are not visible in Datadog.

Code from app/config.py:

STATSD_HOST = os.getenv('STATSD_HOST')
STATSD_PORT = 8125
STATSD_ENABLED = bool(STATSD_HOST)

Code from ECS Fargate task definition:

"portMappings": [
  {
    "containerPort": 8125,
    "hostPort": 8125,
    "protocol": "udp"
  },

Engineering Checklist

Helpful tips

Acceptance Criteria

QA Considerations

For QA to populate. Leave blank if QA is not applicable on this ticket.

Potential Dependencies

Out of Scope

mjones-oddball commented 1 year ago

Hey team! Please add your planning poker estimate with Zenhub @cris-oddball @EvanParish @justaskdavidb2 @k-macmillan @kalbfled @ldraney @nikolai-efimov

ldraney commented 1 year ago

Maybe reach out to Felipe (ask Kyle) about this exact issue

ldraney commented 1 year ago

datadog dockerfile:

FROM datadog/agent:latest

RUN \
  DEBIAN_FRONTEND=noninteractive apt-get update \
  && apt-get install -y --no-install-recommends -o Dpkg::Options::="--force-confnew" \
      ca-certificates jq \
    && apt-get clean

COPY ./certs/VA-Internal-S2-RCA1-v1.crt /usr/local/share/ca-certificates/VA-Internal-S2-RCA1-v1.crt
RUN /usr/sbin/update-ca-certificates

COPY ./scripts/datadog_expose_task_arn.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

ENTRYPOINT ["/entrypoint.sh"]

the script it calls:

#!/bin/bash

# a workaround for datadog agent under-reporting statsd metrics on ecs fargate
# taken from https://github.com/DataDog/datadog-agent/issues/3159#issuecomment-688978425

set -e
set -o pipefail

if [[ -n "${ECS_FARGATE}" ]]; then
  echo "datadog agent starting up in ecs!"
  echo "trying to get task_arn from metadata endpoint..."

  until [ -n "${task_arn}" ]; do
    task_arn=$(curl --silent 169.254.170.2/v2/metadata | jq --raw-output '.TaskARN | split("/") | last')
  done

  echo "got it. starting up with task_arn $task_arn"
  export DD_HOSTNAME=task-$task_arn

fi

/init
ldraney commented 1 year ago

for this checkbox "Verify StatsD configurations in the application and ECS Fargate task definition" I have verified that we have set up our application as Datadog documentation indicated. I've sent a somewhat lengthy message to Kyle to clarify what information we seem to be missing, because it seems to me we are getting statsd information.

The documentation says:

By default, DogStatsD listens on UDP port 8125, so you need to bind this port to your host port when running the Agent in a container. If your StatsD metrics come from outside of localhostyou must set DD_DOGSTATSD_NON_LOCAL_TRAFFIC to true

I've confirmed this and even played with the DD_DOGSTATSD_NON_LOCAL_TRAFFIC to see if that makes a difference, but I don't see a difference in the logs at this point. I'm deploying this variable to api and celery to double check.


note added 5:08 EST: Maybe I should try and delete the above script, maybe datadog no longer has the gap the previous team noticed, or maybe this is helping. I'm going to delete this script and see what happens.

ldraney commented 1 year ago

Something to note though that in all our environments (at first glance) we get this: image.png

These are the names of the tasks. Is statsd what is showing these? For example send-inbound-sms is in ./app/celery/service_callback_tasks.py:

@notify_celery.task(bind=True, name="send-inbound-sms", max_retries=5, default_retry_delay=300)
@statsd(namespace="tasks")
def send_inbound_sms_to_service(self, inbound_sms_id, service_id):
    service_callback = get_service_inbound_sms_callback_api_for_service(service_id=service_id)
    if not service_callback:
        current_app.logger.error(
            'could not send inbound sms to service "%s" because it does not have a callback API configured',
            service_id
        )
        return

Is this what Kyle means by the incomplete information? I'm not sure how to test or verify if changes I make are adding what we are looking for.

ldraney commented 1 year ago

Perhaps this is an example of missing information?

def send_callback_metrics(notification):
    statsd_client.incr(f"callback.sns.{notification.status}")
    if notification.sent_at:
        statsd_client.timing_with_dates("callback.sns.elapsed-time", datetime.utcnow(), notification.sent_at)

if this were to showup in datadog I would expect the metric to literally begin with callback.sns and be able to search and find that in the logs, but I cannot. This is just one example.

ldraney commented 1 year ago

Another example of incomplete information might be in app/va/vetext/client.py:

class VETextClient:
    STATSD_KEY = "clients.vetext"
    TIMEOUT = 3.05
...
        except requests.HTTPError as e:
            self.logger.exception(e)
            self.statsd.incr(f"{self.STATSD_KEY}.error.{e.response.status_code}")

This should send the exact string clients.vetext into our Datadog logs, but that string hasn't appeared for at least the last month.

ldraney commented 1 year ago

Maybe we are getting statsd in metrics rather than in the logs: image.png Which is exactly what is set in app/va/vetext/client.py:

class VETextClient:
    STATSD_KEY = "clients.vetext"
    TIMEOUT = 3.05

    def init_app(self, url: str, credentials: Credentials, logger: Logger, statsd: StatsdClient):
        self.base_url = url
        self.auth = HTTPBasicAuth(**credentials)
        self.logger = logger
        self.statsd = statsd
...
   try:
            start_time = monotonic()
            response = requests.post(
                f"{self.base_url}/mobile/push/send",
                auth=self.auth,
                json=payload,
                timeout=self.TIMEOUT
            )
            self.logger.info("VEText response: %s", response.json() if response.ok else response.status_code)
            response.raise_for_status()
        except requests.HTTPError as e:
            self.logger.exception(e)
            self.statsd.incr(f"{self.STATSD_KEY}.error.{e.response.status_code}")
            if e.response.status_code in [429, 500, 502, 503, 504]:
....
            self.statsd.timing(f"{self.STATSD_KEY}.request_time", elapsed_time)
ldraney commented 1 year ago

Another example, this time from queue_callback_strategy.py, is the callback.webhook.deliver_status.non_retryable_error, which does indeed show up in metrics one you specifically start typing it: image.png

So it seems to me that we've confirmed this information does indeed show up in Datadog! Just in a tab we were not familiar with. If that's true, next step (outside the scope of this ticket) would need to read through, or make a list of the statsd information we want, and then make custom graphs/dashboards.

ldraney commented 1 year ago

I had previously started a thread with the Datadog support team, and sent them the above notes to confirm my understanding.

ldraney commented 1 year ago

The first lines of the email to the Datadog team:

Statsd and a python application on Fargate: statsD information should just be showing up in our metrics explorer, right? Please confirm our configuration and that I'm viewing the information correctly:

I then proceeded to share examples from our code and what we are seeing in our dashboard.

He responded:

DogStatsD does have a stats mode in which you can see which metrics are the most processed. To enable this mode:

He then proceeded to explain how to set it up.

This feature is outside the scope of this ticket because he seems to be saying "yes you can see statsd information, but you can get additional details about which stats are coming in most often", which may not even be something we necessarily want.

Again, it seems to me that we've confirmed statsd is in Datadog! Any statsd information we need is likely already available, so understanding which data we want to create metrics and dashboards from is outside the scope of this ticket.

cris-oddball commented 1 year ago

@ldraney will we create a new ticket to determine what metrics and dashboards we need, or will we just ad hoc that until we realize we need something?

ldraney commented 1 year ago

@cris-oddball That's a question for the team, I think. Kyle has already expressed that he wants an alert on sms/email time, so I started a System Reliability Dashboard based on what I've learned during this statsd ticket. Perhaps a spike to list out what metrics are available from statsd, and then to determine which we want in a system reliability dashboard, and then a ticket to appropriately configuring the alerts, are appropriate next steps.

ldraney commented 1 year ago

I created #1342 and #1343, and requested for Kyle to take a look at them. He may ask for adjustment or changes, but I'll consider that a valid proposed solution, and will now close this ticket.