CDCgov / trusted-intermediary

Bringing together healthcare providers by reducing the connection burden.
Apache License 2.0
10 stars 5 forks source link

ReportStream Internal Error Monitoring #1147

Closed JohnNKing closed 2 weeks ago

JohnNKing commented 2 months ago

Story

As an Intermediary engineer, so that I can notify CA about any errors that occur, I need a way to identify NBS error that occur during intermediary processing within ReportStream.

Pre-conditions

Acceptance Criteria

Tasks

Engineering

AC Queries

Definition of Done

Research Questions

Decisions

Notes

JohnNKing commented 1 month ago

Blocked waiting on access and a meeting to show us the process

JohnNKing commented 1 month ago

Discussing RS proposal today

basiliskus commented 1 month ago

We now have access to RS logs in staging and production. We'll evaluate if possible to create alerts for error logs

basiliskus commented 1 month ago

RS is actively working in adding log parameters that will allow us to query the logs filtering by topic, sender, receiver, etc. Here's the PR: https://github.com/CDCgov/prime-reportstream/pull/15263

basiliskus commented 1 month ago

Currently RS doesn't have the permissions required to create the log alerts and queries. They will look into getting access

basiliskus commented 1 month ago

I'll mark this story as blocked while RS works on that PR and the ability to create alerts

basiliskus commented 1 month ago

RS PR has been merged. Removing the blocked label

basiliskus commented 1 month ago

Here's a KQL query to filter by the etor-ti topic, now that the capability has been added to RS:

traces
| extend customDimensionsParsed = parse_json(customDimensions)
| where customDimensionsParsed.TOPIC == '"etor-ti"'
basiliskus commented 1 month ago

Here's a first version of a KQL query to select and unpack the fields we care about

traces
| project timestamp, message, customDimensions, appId
| extend 
    messageParsed = parse_json(message),
    customDimensionsParsed = parse_json(customDimensions)
| extend 
    mdc_span_id = messageParsed.mdc.span_id,
    mdc_trace_flags = messageParsed.mdc.trace_flags,
    mdc_trace_id = messageParsed.mdc.trace_id,
    message_content = messageParsed.message,
    message_thread = messageParsed.thread,
    message_timestamp = messageParsed.timestamp,
    message_level = messageParsed.level,
    message_logger = messageParsed.logger,
    customDimensions_ProcessId = customDimensionsParsed.ProcessId,
    customDimensions_Category = customDimensionsParsed.Category,
    customDimensions_HostInstanceId = customDimensionsParsed.HostInstanceId,
    customDimensions_LogLevel = customDimensionsParsed.LogLevel
| project 
    timestamp,
    appId,
    mdc_span_id,
    mdc_trace_flags,
    mdc_trace_id,
    message_content,
    message_thread,
    message_timestamp,
    message_level,
    message_logger,
    customDimensions_ProcessId,
    customDimensions_Category,
    customDimensions_HostInstanceId,
    customDimensions_LogLevel
basiliskus commented 1 month ago

Another query with extended fields found in customDimensions:

traces
| extend 
    messageParsed = parse_json(message),
    customDimensionsParsed = parse_json(customDimensions)
| extend 
    messageTimestamp = messageParsed.timestamp,
    messageLevel = messageParsed.level,
    messageContent = messageParsed.message,
    messageLogger = messageParsed.logger,
    messageMdc = messageParsed.mdc,
    messageThread = messageParsed.thread,
    customProcessId = customDimensionsParsed.ProcessId,
    customCategory = customDimensionsParsed.Category,
    customHostInstanceId = customDimensionsParsed.HostInstanceId,
    customLogLevel = customDimensionsParsed.LogLevel,
    customPipelineStepName = customDimensionsParsed.pipelineStepName,
    customParentReportId = customDimensionsParsed.parentReportId,
    customChildReportId = customDimensionsParsed.childReportId,
    customBLOB_URL = customDimensionsParsed.BLOB_URL,
    customBlobUrl = customDimensionsParsed.blobUrl,
    customCdProcessId = customDimensionsParsed.ProcessId,
    customSender = customDimensionsParsed.sender,
    customSubmittedReportIds = customDimensionsParsed.submittedReportIds,
    customCdTimestamp = customDimensionsParsed.timestamp,
    customTopic = customDimensionsParsed.topic,
    customTrackingId = customDimensionsParsed.trackingId
| project
    timestamp, 
    message, 
    customDimensions,
    messageTimestamp,
    messageLevel,
    messageContent,
    messageLogger,
    messageMdc,
    messageThread,
    customProcessId,
    customCategory,
    customHostInstanceId,
    customLogLevel,
    customPipelineStepName,
    customParentReportId,
    customChildReportId,
    customBLOB_URL,
    customBlobUrl,
    customCdProcessId,
    customSender,
    customSubmittedReportIds,
    customCdTimestamp,
    customTopic,
    customTrackingId
| where customTopic == "etor-ti"
basiliskus commented 1 month ago

Documented AppInsights KQL queries in RS: https://github.com/CDCgov/prime-reportstream/blob/master/prime-router/docs/observability/azure-events.md

basiliskus commented 1 month ago

After the most recent update in RS, this is how we can query logs by sender name:

customEvents
| where name == "REPORT_RECEIVED"
| extend params = parse_json(tostring(customDimensions.params))
| where params.senderName == "flexion.etor-service-sender"

By receiver name:

customEvents
| where name == "REPORT_SENT"
| extend params = parse_json(tostring(customDimensions.params))
| where params.receiverName == "la-phl.etor-nbs-orders"
basiliskus commented 4 weeks ago

It seems this is where the terraform resource should be added in RS: operations/app/terraform/modules/application_insights/

basiliskus commented 3 weeks ago

I'm documenting error scenarios in RS and the queries associated to troubleshooting here

JohnNKing commented 3 weeks ago

Just one thing I noticed in comparing REPORT_RECEIVED to REPORT_SENT on the etor-ti topic: there's a slight disparity that not routed outcomes and trace errors don't seem to account for.

Past month:

Looking at just Aug 13-15 (midnight-midnight UTC)

basiliskus commented 2 weeks ago

I found the error for the message that arrived to RS on 7:39 PM UTC on 8/13: Unexpected scheme: null. To get the error you can use this query: `traces | where severityLevel > 1