Open codemonkey800 opened 11 months ago
recently got some 400 errors today related to a user somehow accessing the plugins page using the template variable [name]
:
this would mean they accessed /plugins/[name]
somehow. overall this isn't necessarily an error we have to worry about since it's a client error, so we can filter these out by reducing the log level to warning
. I've added the task Assign lower log level to 4xx errors
to capture this 🫡
Caught Error Alarms
These are errors that are caught and logged in the frontend. The errors for these logs are can be surfaced from CloudWatch Logs.
Error fetching spdx license data
This log can be found in CloudWatch when filtering using the following query:
This is related to error logs that happens when fetching the SPDX license data on the browser throws an error:
https://github.com/chanzuckerberg/napari-hub/blob/41ae7000b6c33ad3a3f5d3bcb21886a9d89f3d1a/frontend/src/components/MetadataList/MetadataListMetadataItem.tsx#L126-L132
Fetching this data on the browser is probably inefficient and more prone to error because of the user’s environment. We can move this fetch to the server side to improve the reliability of this API call. If this doesn’t reduce the amount of errors occurring, we can look into reducing the log level of this message.
Error loading route
This log can be found in CloudWatch when filtering using the following query:
This is related to some code for logging when an error occurs while a page is transitioning:
https://github.com/chanzuckerberg/napari-hub/blob/41ae7000b6c33ad3a3f5d3bcb21886a9d89f3d1a/frontend/src/hooks/usePageTransitions.ts#L66-L69
According to the docs, this error occurs if the route transition is cancelled or if an error is thrown, but the code above doesn’t check for this when logging the error message. We can refactor the code to use a different log level depending on if the user cancelled the transition or not:
Ideally this should reduce the amount of actual errors we encounter, but if not, we can look into filtering out this error from the logs metric filter if it’s something we can’t easily fix.
Uncaught Error Alarms
These are alarms that are not handled within a try / catch block. Currently RUM has reported the following errors:
CWR: Failed to retrieve credentials from STS: TypeError: Failed to fetch
This error occurs when a network error occurs while fetching credentials from AWS STS. The stacktrace for this message looks like:
Unfortunately we can't really fix this error since we can't control user network conditions. Instead, we can try filtering this event from being tracked by the alarm.
To do this, we will need to refactor the alarm infrastructure to:
Error details: CWR: Failed to retrieve Cognito OpenId token: TypeError: Failed to fetch
Similar to the above error, this is out of our control due to user network conditions. We can remove this from the frontend alarm by ignoring this specific error message.
The provided
href
(/plugins/[name]) value is missing query valuesAccording to the docs, this error occurs when the UI tries to open a URL that does not have the provided variable in the pathname.
This error is a bit complex to debug because it happens intermittently and is not easy to reproduce. The frequency appears to be 1-2 instance per week:
The plugin page also does not have links to itself or plugin pages, so it seems technically impossible for this error to occur.
One thing we can try is updating all references to
/plugins/[name]
to check thatname
is defined before creating a link or navigating to a route.If this does not reduce the errors, we could reduce the log level since this type of error doesn't have a huge impact on the functionality of the page. It's possible this error could be a result of an intermittent loading state since some of the errors happen in the loading state for the plugin page.
Script error
These are unknown errors that happen during JavaScript execution that seemingly only happen on Desktop Safari browsers:
This error may occur when the frontend tries to load JavaScript from another domain. Based on this article, we can possibly fix this by updating references to external JavaScript to include the
crossorigin
property in the<script>
tag.The only reference to this is the script we use for hub spot:
https://github.com/chanzuckerberg/napari-hub/blob/41ae7000b6c33ad3a3f5d3bcb21886a9d89f3d1a/frontend/src/pages/_app.tsx#L88-L93
If this does not reduce the errors, we can look into filtering out this message for this specific error.
Request aborted
This error occurs when a request is cancelled which may happen if the user navigates away from a page with an in-progress request, so it should be safe to filter out.
ResizeObserver loop completed with undelivered notifications.
This error occurs when
ResizeObserver
is trying to notify subscribers of a recent resize. This error may occur if the users page resizes during a notification. Unfortunately we can't control this because of the variety of differences in the user's environment like viewport and browser, so this is something we can look into filtering out.Action Items
name
is defined for all references to/plugins/[name]