edx / edx-arch-experiments

A plugin to include applications under development by the architecture team at edx
GNU Affero General Public License v3.0
0 stars 3 forks source link

[Alerts] Determine how to ignore common errors #625

Open alangsto opened 3 months ago

alangsto commented 3 months ago

It is possible to ignore errors in New Relic, which 2U does for certain types of errors. See https://2u-internal.atlassian.net/wiki/spaces/AT/pages/16385812/Ignored+and+Expected+Errors+for+LMS+in+New+Relic.

Does Datadog have similar functionality for ignoring errors? I found https://docs.datadoghq.com/logs/error_tracking/excluding_logs/ but am not sure if this is a 1:1 solution for what New Relic has.

An example error we'd like to ignore is:

AC:

robrap commented 3 months ago
  1. @alangsto also found https://docs.datadoghq.com/logs/error_tracking/manage_data_collection#add-a-nested-exclusion-filter-to-a-rule.
  2. Is this a ticket we need to look into sooner rather than later so that alert thresholds don't all get messed up, or is DD already not picking up 404, 401, and 403, which were a large part of our ignored errors?
robrap commented 3 months ago

@alangsto: Have you found that this is just a non-issue for now? If so, we can update the epic to "Datadog Migration Future" and review one more time when we are done.

UPDATE: I moved the rest of this comment for discovery of other DD error monitor types to a new ticket: https://github.com/edx/edx-arch-experiments/issues/651.

alangsto commented 3 months ago

@robrap I have not run into this issue yet, but that's with my work on setting up Cosmonauts monitors (which are for the most part fairly straight forward). Other teams may run into this issue, but it's difficult to know without investigating every alert condition in New Relic. I did add a small section in https://2u-internal.atlassian.net/wiki/spaces/ENG/pages/1008500757/How+to+migrate+from+New+Relic+to+Datadog#Migrating-a-NRQL-based-alert for how to filter by specific messages for a trace analytics monitor.

I have not investigated the two types of monitors you listed. Is this something we'd like to do investigation into to provide more info to teams?

robrap commented 3 months ago

@alangsto: Thank you. I moved this to the Future epic. We'll see if it comes up once the DRF error reporting in DD is fixed in https://github.com/edx/edx-arch-experiments/issues/647.

Also, as noted above, I moved the other discovery into a new ticket which is also under the Future epic, and we'll see when and if anyone is interested in exploring those features.

dianakhuang commented 3 weeks ago

I believe https://github.com/edx/edx-arch-experiments/issues/738 is a duplicate of this ticket.

robrap commented 3 weeks ago

@dianakhuang: What do you think of closing this ticket in favor of your new ticket, which at least provides a clear example of an error we wish to ignore? Is there anything else from this ticket you'd like to bring in?

dianakhuang commented 3 weeks ago

I think I would rather keep this ticket and move over the example. This ticket has a lot more info than mine does.

robrap commented 1 week ago

We need more details about the specific alerts that are triggering based on errors that we wish to ignore, so we have a specific case to fix. For now, marking this as P5, and may close (temporarily) until we have that information.