department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 196 forks source link

Monitoring: Datadog for VFS Teams #33081

Open jhouse-solvd opened 2 years ago

jhouse-solvd commented 2 years ago

Problem Statement

The platform's legacy monitoring solution lacks features desired by developers. Getting access to the platform's monitoring tools is a cumbersome process. The approval process can involve long delays. It's difficult to know how an app is performing without these tools.

Background

Resources

Platform Website: Platform tools page Overview of Datadog capabilities


How might we?

...make it easier for VFS Teams to use the platform's monitoring tools? ...provide better tools for monitoring, alerting, and visualization of VFS products? ...reduce the administrative overhead involved in providing access to tools for VFS Teams?

Hypothesis or Bet

We will know we're done when... ("Definition of Done")

Known Blockers/Dependencies

List any blockers or dependencies for this work to be completed

Projected Launch Date

TBD

Launch Checklist

Is this service / tool / feature...

... tested?

... documented?

... measurable

When you're ready to launch...

Required Artifacts

Documentation

Testing

Measurement

mchelen-gov commented 2 years ago

Additional notes:

Background

Hypothesis or bet

AC

tpharrison commented 2 years ago

Hi @jhouse-solvd. Here is the ticket outlining our request for Datadog access. Let me know if you need anything else. Thanks!

ElijahLynn commented 2 years ago

Per discussion with @jhouse-solvd here, posting our use case.

The VFS-CMS team needs read only access to the CMS Dashboards here > https://vagov.ddog-gov.com/dashboard/lists?q=cms, so they can have similar access as to what they had with Grafana.

jhouse-solvd commented 1 year ago

Follow up to @ElijahLynn comment above here

It might be good to evaluate if VFS teams can get access to Datadog without the Platform Infrastructure team being a blocker. @mchelen-gov may have guidance for use cases where this would make sense.

jhouse-solvd commented 1 year ago

Discovery around VFS user cases should be undertaken to understand which features and data they need access to. It would be good to engage Platform research experts to help gather that information.

tpharrison commented 1 year ago

Hi @jhouse-solvd. In May/June, my team (auth exp) started using Datadog to monitor activity related to our vets-api endpoints and alert us of any unusual activity.

In July, we lost access when the switch was made to the GovCloud account. Our team has left this ticket open which outlines our request for Datadog access.

Just following up to find out if/when we can get access. Thanks!

jhouse-solvd commented 1 year ago

@tpharrison - Hi Tom! Thanks for the nudge and your patience. We're looking into this and should have more info to share soon. We want to understand how teams use Datadog to ensure we have it set up correctly.

tpharrison commented 1 year ago

@jhouse-solvd Thanks for the quick response! Any idea on how long the review will take? Just looking for a rough timeframe to help with team planning.

jhouse-solvd commented 1 year ago

UPDATE 11/7/22

We aim to provide an excellent experience for teams using the VA.Gov Platform's monitoring tools. To do this, we're working w/ the Platform Service Design team to understand the monitoring needs of VFS developers (and use cases).

Together, we'll undertake discovery research that will provide valuable insights into VFS needs. From there, we'll look to selectively onboard developers to Datadog and collect their feedback.

For VFS developers wanting to use Datadog(incl. those that previously requested access), Stay tuned! We'll follow up soon.

Note: We're not yet sure about the timeline but should know more in the coming weeks.

cc: @mchelen-gov @jwoodman5

TheBoatyMcBoatFace commented 1 year ago

Hi @jhouse-solvd -

I'm the new PM for Platform CMS DevOps and looking for access to Datadog. I'd like insight into the my team's infrastructure and a sandbox board to build out/finetune some visualizations.

I've worked with Domo, Sentry, Elastic, & Grafana, but never Datadog. I'd be glad to provide feedback/beta test features/functionality if needed.

jhouse-solvd commented 1 year ago

UPDATE 11/22/22

We want to ensure a streamlined onboarding experience and have access to the right features and metrics.

We know there’s a lot of interest in this tool, and we thank you for your patience. ☺️

cferris32 commented 1 year ago

Hi there, I'm a vets-api developer for the VAOS team and we leverage Datadog as one of our health monitoring tools. Currently I have read-only access but would like to upgrade my account so I can enhance our monitoring as needed, especially since we have a big release coming soon. Let me know as soon as this would be feasible and I'm happy to provide credentials and take any extra steps needed on my end. Thanks so much!

ksk385 commented 1 year ago

Hi, I am from the Chatbot team and we would like to test out DataDog for our monitoring/alerting needs. How can we get accounts for my team to try it out? Thanks!

TheBoatyMcBoatFace commented 1 year ago

Hi @jhouse-solvd -

I'm the new PM for Platform CMS DevOps and looking for access to Datadog. I'd like insight into the my team's infrastructure and a sandbox board to build out/finetune some visualizations.

I've worked with Domo, Sentry, Elastic, & Grafana, but never Datadog. I'd be glad to provide feedback/beta test features/functionality if needed.

Following up on my Datadog request - lack of access has become a blocker for some of my projects, so I'm trying, if possible, to get this resolved sooner rather than later.

If there is anything I can do to help the process along, please let me know.

jhouse-solvd commented 1 year ago

@TheBoatyMcBoatFace - Thanks for the nudge. We're looking to provide read access soon, but it sounds like you might also need write access. Is that correct?

TheBoatyMcBoatFace commented 1 year ago

Thanks @jhouse-solvd. Yes, I'm taking point on the dashboards, KPI, and alerts for my team. I'm glad to walk you through my use case if it would be helpful.

oseasmoran73 commented 1 year ago

image

FYI, you can set perms boundary. It is not a single team, but rather a list

jhouse-solvd commented 1 year ago

@tpharrison @cferris32 @ksk385 @cohnjesse @vhenry07 @kjduensing

You previously requested or expressed interest in access to Datadog. If you have a few minutes, we'd love to get your input on this survey: https://ows.io/qs/ns7w5ro0

We want to understand various use cases to ensure that tools, documentation, and resources are available to support your needs properly. Your input will help us prepare for release. ☺️

chrisj-usds commented 1 year ago

@tpharrison @cferris32 @ksk385 @cohnjesse @vhenry07 @kjduensing

You previously requested or expressed interest in access to Datadog. If you have a few minutes, we'd love to get your input on this survey: https://ows.io/qs/ns7w5ro0

We want to understand various use cases to ensure that tools, documentation, and resources are available to support your needs properly. Your input will help us prepare for release. ☺️

Tagging @batemapf and @klawyer who are interested in getting onboard with DD asap

batemapf commented 1 year ago

thanks @chrisj-usds. filled out survey.

jhouse-solvd commented 1 year ago

A question came up in Slack:

Any idea if the platform team will be the ones making requests to the DOTS team for access on behalf of VFS teams, or if VFS teams should just go ahead and request access directly from the DOTS team? or does the platform team have direct administrative control over vagov.dd-gov.com?

We're considering options and hope to minimize onboarding friction for VFS teams.

Background

Options

  1. VFS Teams request access directly from DOTS
  2. Platform requests access from DOTS on behalf of VFS Teams
  3. Other, TBD

What do you think would work best?

batemapf commented 1 year ago

how about:

  1. ask DOTS to grant all platform users who currently have access to grafana access to datadog, with read only perms
  2. figure out some criteria for granting write perms
  3. use that criteria to change perms of previously created read only accounts to read and write, since this can be done without involvement of DOTS

then you’ll need to incorporate a version on this into the onboarding process im guessing


From: Jesse House @.> Sent: Wednesday, January 4, 2023 3:55:01 PM To: department-of-veterans-affairs/va.gov-team @.> Cc: Patrick Bateman @.>; Mention @.> Subject: [EXTERNAL] Re: [department-of-veterans-affairs/va.gov-team] Monitoring: Datadog for VFS Teams (Issue #33081)

A question came up in Slack:

Any idea if the platform team will be the ones making requests to the DOTS team for access on behalf of VFS teams, or if VFS teams should just go ahead and request access directly from the DOTS team? or does the platform team have direct administrative control over vagov.dd-gov.comhttps://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fvagov.dd-gov.com%2F&data=05%7C01%7C%7C84566785388345d805ed08daee95f675%7Ce95f1b23abaf45ee821db7ab251ab3bf%7C0%7C0%7C638084625095152224%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7MIb7mxZiQASApMMqLITzzuBrFAOzQV5jRH3sqrLSPg%3D&reserved=0?

We're considering options and hope to minimize onboarding friction for VFS teams.

Background

Options

  1. VFS Teams request access directly from DOTS
  2. Platform requests access from DOTS on behalf of VFS Teams

What do you think would work best?

— Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdepartment-of-veterans-affairs%2Fva.gov-team%2Fissues%2F33081%23issuecomment-1371407829&data=05%7C01%7C%7C84566785388345d805ed08daee95f675%7Ce95f1b23abaf45ee821db7ab251ab3bf%7C0%7C0%7C638084625095152224%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=KOKjZQebTNX6dldZRGmfOtfmoNzOrJxGl9gUjSi%2F2l4%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAB2KRMUDDAVXP76ZGFLCTGLWQXPSLANCNFSM5IKFQJ3Q&data=05%7C01%7C%7C84566785388345d805ed08daee95f675%7Ce95f1b23abaf45ee821db7ab251ab3bf%7C0%7C0%7C638084625095152224%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=S3MwvXNoJ8eFADOVxvn6HHAlMvgS%2Fh7MepAA%2FVyFKic%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

jhouse-solvd commented 1 year ago

@batemapf - This is a great idea.

Regarding number 1, (I believe) the only challenge to that workflow would be setting up users in Okta:

Currently, users have to log in to Datadog via Okta. When speaking w/ DOTS previously, a user account has to be provisioned in Okta, and each user has 24 hours to respond to the invite. Once the user has accepted the Okta invite, DOTS can add Datadog as an application to be launched from Okta.

Regarding numbers 2 and 3, I think that's spot on. We're hoping to define those criteria soon, and we're looking to VFS teams to help us understand which features and resources they need access to within Datadog.

tpharrison commented 1 year ago

Hi Jesse

On 1/4, I was asked to fill out this survey (https://ows.io/qs/ns7w5ro0) but I think I filled it out a couple of weeks ago. I was wondering if you can confirm.

Thanks.

On Thu, Jan 5, 2023 at 12:52 PM Jesse House @.***> wrote:

@batemapf https://github.com/batemapf - This is a great idea.

Regarding number 1, (I believe) the only challenge to that workflow would be setting up users in Okta:

Currently, users have to log in to Datadog via Okta. When speaking w/ DOTS previously, a user account has to be provisioned in Okta, and each user has 24 hours to respond to the invite. Once the user has accepted the Okta invite, DOTS can add Datadog as an application to be launched from Okta.

Regarding numbers 2 and 3, I think that's spot on. We're hoping to define those criteria soon, and we're looking to VFS teams to help us understand which features and resources they need access to within Datadog.

— Reply to this email directly, view it on GitHub https://github.com/department-of-veterans-affairs/va.gov-team/issues/33081#issuecomment-1372541674, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE7TW6TUWWEQPRG4XQVNK3WQ4C63ANCNFSM5IKFQJ3Q . You are receiving this because you were mentioned.Message ID: @.*** .com>

jhouse-solvd commented 1 year ago

@tpharrison - Confirmed! Thank you so much for sharing your input. It's genuinely appreciated.

We'll review soon and follow up!

cferris32 commented 1 year ago

@tpharrison @cferris32 @ksk385 @cohnjesse @vhenry07 @kjduensing

You previously requested or expressed interest in access to Datadog. If you have a few minutes, we'd love to get your input on this survey: https://ows.io/qs/ns7w5ro0

We want to understand various use cases to ensure that tools, documentation, and resources are available to support your needs properly. Your input will help us prepare for release. ☺️

Hello, I've filled out the survey and am still in need of write-access in Datadog from my original comment. Please let me know if there's anything else I can do to help make this happen. Thanks so much!

jhouse-solvd commented 1 year ago

Hi @cferris32 - Thanks so much for filling out the survey. We'll be reviewing and synthesizing feedback soon.

Before providing write access to Datadog, the Platform needs to develop roles and permissions for VFS teams (See #47066) and publish documentation on the Platform Website (along with a few other tasks). The information you've provided will be instrumental in making those things happen.

We're meeting with leadership tomorrow to discuss the next steps and hope to provide you with more information - including the release plan and schedule - in the coming weeks.

We appreciate your patience while we work to provide this solution. If this presents any blockers for your team, please escalate to your VA product owner to aid with prioritization.

cc: @mchelen-gov

jhouse-solvd commented 1 year ago

UPDATE 1/31 (also shared during weekly Team of Teams meeting)

Monitoring for VFS teams: Read-only (RO) access to Datadog - beginning February 21

We’re excited to announce the upcoming release of Read-only (RO) access to Datadog for VFS Teams beginning February 21, 2023!

What’s included in the upcoming release?

How will VFS teams get access?

What do VFS teams need to do before then?

cc: @mchelen-gov

nanotone commented 1 year ago

As requested by @jhouse-solvd, I'm adding a note here to describe a use case that the Benefits fast-track team is looking for (including a few steps that are already done).

At time of writing, the vets-api code posts a message to one Slack channel for every success, and another channel for every failure; our current process is just to manually check these channels once a day for abnormalities. This is workable because failed API calls, while not ideal, are not catastrophic for any Veteran. We'd just like the process to be more automated.

Screen Shot 2023-03-15 at 13 47 49
mchelen-gov commented 1 year ago

@nanotone Thanks for the excellent description! Based on my understanding it sounds like write access to the monitors/alerts would satisfy the 2nd two boxes.

ElijahLynn commented 1 year ago

Monitoring for VFS teams: Read-only (RO) access to Datadog - beginning February 21

We’re excited to announce the upcoming release of Read-only (RO) access to Datadog for VFS Teams beginning February 21, 2023!

Now that VFS teams have access, can we get https://vfs.atlassian.net/wiki/spaces/OT/pages/2233598117/Get+Access+to+Datadog updated? It currently says:

NOTE: Datadog is not currently supported for VFS teams. However, support for VFS teams will be addressed as part of this initiative: [Datadog] Monitoring for VFS Teams #33081