Improve Error Handling and Debugging Posture

ccostino commented 4 months ago

There are several parts of the application that would greatly benefit from better logging configuration as well as having more log statements added to provide clearer insight into how the application is behaving, regardless if things are going well or not.

Additionally, there are a few things we could do to help ourselves with debugging and application development by adding some additional tools (e.g., using a more robust debugger rather than just pdb), utility methods, and project/test configuration.

NOTE: Story issues have not been created yet, this is very much still a WIP!

Improve error handling

There are several wrappers found in the notification_utils code that swallow errors or raise exceptions in ways that are counter-intuitive and counter-productive; we need to adjust these to make them more useful
- See https://github.com/GSA/notifications-admin/issues/1392 for a recent example

Improve debugging

Look into possibly refactoring the hilite method as another logging formatter as well; also see if it can be combined with or leverage anything from Werkzeug if that'd make it even more useful or robust
- Alternatively, check to see if it could be refactored into
Investigate other Python debuggers and see if there is something else we could be leveraging for an improved debugging experience
- We could also look into leveraging Visual Studio Code more (e.g., Python Debugger Extension
- There's a pytest piece to this too, in terms of how to debug tests

xlorepdarkhelm commented 2 months ago

Here is some thoughts for improving logging & error handling

Certainly! Here’s a combined series of sprints that integrate both the error handling improvements and the logging enhancements, structured into manageable tasks over multiple two-week Agile sprints.

Sprint 1: Audit, Documentation, and Standardization

Task 1: Comprehensive Audit

Objective: Conduct a full audit of both logging and error handling practices across the project.
Steps:
- Identify all try-except blocks, logger.error, logger.exception, print statements, and other logging or error-handling mechanisms.
- Document the current practices, noting inconsistencies, missing exc_info=True in error logs, usage of generic exceptions, and places where logging is either missing or suboptimal.
- Identify all print statements and prepare to replace them with appropriate logging statements.

Task 2: Documentation of Best Practices

Objective: Establish guidelines for consistent and robust logging and error handling.
Steps:
- Create a document outlining best practices for logging and error handling, including the use of f-strings, specific exception handling, and when to use different logging levels.
- Include examples and scenarios specific to the project.
- Review this document with the team and gather feedback.

Sprint 1 Deliverables:

A comprehensive audit report detailing the current state of logging and error handling.
A best practices guide for logging and error handling, tailored to the project.

Sprint 2: Replace `print` Statements and Improve Error Logging

Task 1: Replace `print` Statements with Logging

Objective: Replace all print statements with appropriate logging calls.
Steps:
- Search for all print statements in the codebase.
- Replace print statements with logger.info, logger.debug, or other appropriate logging levels using f-strings.
Example Before:
```
print(f"Processing data: {data}")
```
Example After:
```
logger.info(f"Processing data: {data}")
```

Task 2: Enhance Error Logging with Stack Traces

Objective: Ensure all logger.error statements include stack traces where applicable.
Steps:
- Update logger.error statements to include exc_info=True for stack traces.
- Refactor the log messages to use f-strings for consistency and readability.
Example Before:
```
logger.error("An error occurred: %s", str(e))
```
Example After:
```
logger.error(f"An error occurred: {e}", exc_info=True)
```

Sprint 2 Deliverables:

All print statements replaced with appropriate logging statements.
Enhanced logger.error statements with stack traces and f-strings.

Sprint 3: Centralized Error Handling and Logging Standardization

Task 1: Implement Centralized Error Handling

Objective: Create a centralized error handling mechanism for the project.
Steps:
- Implement a global error handler (e.g., for Flask, Celery) that logs all uncaught exceptions using logger.exception.
- Ensure that the global handler uses f-strings for logging messages.
Example for Flask:
```
@app.errorhandler(Exception)
def handle_exception(e):
  logger.exception(f"Unhandled exception: {e}")
  return {"error": "An unexpected error occurred"}, 500
```

Task 2: Standardize Logging Levels

Objective: Ensure consistent usage of logging levels across the project.
Steps:
- Review all logging statements and ensure that the correct logging level is used (DEBUG, INFO, WARNING, ERROR, CRITICAL).
- Refactor any info statements that should be debug and vice versa, based on their context.
Example Before:
```
logger.info("Starting process...")
```
Example After:
```
logger.debug("Starting process...")
```

Sprint 3 Deliverables:

Centralized error handler implemented across the project.
Logging levels standardized throughout the codebase.

Sprint 4: Custom Exceptions and Contextual Logging

Task 1: Introduce Custom Exceptions

Objective: Replace generic exceptions with custom exceptions for more specific error handling.
Steps:
- Define custom exceptions for common error scenarios (e.g., InvalidUserInputError, DatabaseConnectionError).
- Replace generic Exception handling with specific custom exceptions.
Example Before:
```
try:
  # some code
except Exception as e:
  logger.error(f"An error occurred: {e}")
```
Example After:
```
try:
  # some code
except InvalidUserInputError as e:
  logger.error(f"Invalid user input: {e}", exc_info=True)
```

Task 2: Implement Contextual Logging

Objective: Enhance log messages with context-specific information, especially in asynchronous tasks and critical code paths.
Steps:
- Add context information (e.g., request ID, user ID) to log messages where appropriate.
- Refactor existing log messages to include relevant context using f-strings.
Example Before:
```
logger.info("Sending email")
```
Example After:
```
logger.info(f"Request {request_id}: Sending email to {user_email}")
```

Sprint 4 Deliverables:

Custom exceptions defined and integrated into the project.
Contextual logging implemented throughout the project.

Sprint 5: Improve Background Task Error Handling and Performance Logging

Task 1: Enhance Error Handling in Background Tasks

Objective: Improve error handling and retry mechanisms in Celery tasks and other background jobs.

Steps:

Review Celery tasks and ensure they use logger.exception for logging errors with stack traces.
Implement or refine retry mechanisms for transient errors.

Example:

@app.task(bind=True)
def my_task(self):
  try:
      # task logic
  except SomeTransientError as e:
      raise self.retry(exc=e, countdown=60, max_retries=3)
  except Exception as e:
      logger.exception(f"Task failed: {e}")

Task 2: Implement Performance Logging

Objective: Introduce logging for key performance metrics (e.g., request processing time, task duration).
Steps:
- Add performance logging to critical sections of the code to track execution time and identify bottlenecks.
- Use f-strings to dynamically include performance metrics in log messages.
Example:
```
start_time = time.time()
# perform task
logger.info(f"Task completed in {time.time() - start_time:.2f} seconds")
```

Sprint 5 Deliverables:

Improved error handling and retry mechanisms in background tasks.
Performance logging implemented in key areas of the project.

Sprint 6: Final Validation and Cleanup

Task 1: Validate and Improve Input/Output Handling

Objective: Ensure robust input and output validation across the project.
Steps:
- Review functions for input validation and add checks where necessary.
- Implement or improve output validation to ensure data integrity.
Example:
```
def process_data(data):
  if not isinstance(data, dict):
      raise InvalidUserInputError(f"Expected a dictionary, got {type(data)}")
```

Task 2: Clean Up Unused Code and Final Refactoring

Objective: Remove dead code and perform final refactoring to ensure the project is clean and maintainable.
Steps:
- Remove any identified dead code, such as unused functions, imports, and unreachable code.
- Perform final refactoring to ensure consistency in error handling and logging practices.

Sprint 6 Deliverables:

Validated input and output handling throughout the project.
Cleaned-up codebase with unused code removed and final refactorings applied.

Overall Outcome:

By the end of these sprints, the project will have robust and consistent logging and error handling practices, leveraging modern Python features like f-strings, custom exceptions, and centralized error management. This will result in a more maintainable, reliable, and performant codebase.

(There could be some parts that need tweaking for specific needs for our project, as we already have integration into logging systems, etc).

xlorepdarkhelm commented 2 months ago

Here is the breakdown of the number of logger statements for each level of logging in the project:

INFO: 228 statements
DEBUG: 38 statements
WARNING: 64 statements
ERROR: 146 statements; 89 error loggers that do not include stack traces + 57 exception loggers that do include stack traces.
CRITICAL: 0 statements

There also are 131 print statements in the code which would need to be retrofitted into being logger statements at the appropriate levels.

xlorepdarkhelm commented 2 months ago

Further, there are 22 lines that are info, but might better be defined as debug logging levels:

File: app/commands.py
- Line 285: current_app.logger.info(f"DATA = {data}")
File: app/aws/s3.py
- Line 111: current_app.logger.info(f"File downloaded successfully to {local_filename}")
File: app/celery/research_mode_tasks.py
- Line 54: current_app.logger.info("Mocked provider callback request finished")
File: app/celery/scheduled_tasks.py
- Line 133: current_app.logger.info("Job(s) {} have not completed.".format(job_ids))
File: app/celery/test_key_tasks.py
- Line 54: current_app.logger.info("Mocked provider callback request finished")
File: app/clients/cloudwatch/aws_cloudwatch.py
- Line 58: current_app.logger.info(f"START TIME {beginning} END TIME {now}")
File: app/service/rest.py
- Line 205: current_app.logger.info(f'SERVICE: {data["id"]}; {data}')
File: app/user/rest.py
- Line 429: current_app.logger.info("Sending email verification for user {}".format(user_id))
- Line 472: current_app.logger.info("Sending notification to queue")
- Line 514: current_app.logger.info("Sending notification to queue")
File: notifications_utils/clients/zendesk/zendesk_client.py
- Line 37: current_app.logger.info(f"Zendesk create ticket {ticket_id} succeeded")
File: app/commands.py
- Line 285: current_app.logger.info(f"DATA = {data}")
File: app/aws/s3.py
- Line 111: current_app.logger.info(f"File downloaded successfully to {local_filename}")
File: app/celery/research_mode_tasks.py
- Line 54: current_app.logger.info("Mocked provider callback request finished")
File: app/celery/scheduled_tasks.py
- Line 133: current_app.logger.info("Job(s) {} have not completed.".format(job_ids))
File: app/celery/test_key_tasks.py
- Line 54: current_app.logger.info("Mocked provider callback request finished")
File: app/clients/cloudwatch/aws_cloudwatch.py
- Line 58: current_app.logger.info(f"START TIME {beginning} END TIME {now}")
File: app/service/rest.py
- Line 205: current_app.logger.info(f'SERVICE: {data["id"]}; {data}')
File: app/user/rest.py
- Line 429: current_app.logger.info("Sending email verification for user {}".format(user_id))
- Line 472: current_app.logger.info("Sending notification to queue")
- Line 514: current_app.logger.info("Sending notification to queue")
File: notifications_utils/clients/zendesk/zendesk_client.py
- Line 37: current_app.logger.info(f"Zendesk create ticket {ticket_id} succeeded")

These statements were identified because they include keywords or patterns that typically align with debug level logging.

And, there are 4 lines that are info but might need to be revised to being a higher level like warning or error:

File: app/errors.py
- Line 52: current_app.logger.info(error)
File: app/errors.py
- Line 57: current_app.logger.info(error)
File: app/errors.py
- Line 62: current_app.logger.info(error)
File: app/errors.py
- Line 69: current_app.logger.info(error)

The log statements occur in the app/errors.py file and should likely be elevated to a higher logging level, such as error.

xlorepdarkhelm commented 2 months ago

Something to note:

Modern versions of Python (3.11 and above) have had radically improved helpers in the stack traces and error messages to dramatically improve the ability to track down an error and fix it. Logging these will greatly improve our ability to pinpoint issues.

GSA / notifications-api

Improve Error Handling and Debugging Posture #1065

Improve error handling

Improve debugging

Sprint 1: Audit, Documentation, and Standardization

Task 1: Comprehensive Audit

Task 2: Documentation of Best Practices

Sprint 1 Deliverables:

Sprint 2: Replace `print` Statements and Improve Error Logging

Task 1: Replace `print` Statements with Logging

Task 2: Enhance Error Logging with Stack Traces

Sprint 2 Deliverables:

Sprint 3: Centralized Error Handling and Logging Standardization

Task 1: Implement Centralized Error Handling

Task 2: Standardize Logging Levels

Sprint 3 Deliverables:

Sprint 4: Custom Exceptions and Contextual Logging

Task 1: Introduce Custom Exceptions

Task 2: Implement Contextual Logging

Sprint 4 Deliverables:

Sprint 5: Improve Background Task Error Handling and Performance Logging

Task 1: Enhance Error Handling in Background Tasks

Task 2: Implement Performance Logging

Sprint 5 Deliverables:

Sprint 6: Final Validation and Cleanup

Task 1: Validate and Improve Input/Output Handling

Task 2: Clean Up Unused Code and Final Refactoring

Sprint 6 Deliverables:

Overall Outcome:

GSA / notifications-api

Improve Error Handling and Debugging Posture #1065

Improve error handling

Improve debugging

Sprint 1: Audit, Documentation, and Standardization

Task 1: Comprehensive Audit

Task 2: Documentation of Best Practices

Sprint 1 Deliverables:

Sprint 2: Replace print Statements and Improve Error Logging

Task 1: Replace print Statements with Logging

Task 2: Enhance Error Logging with Stack Traces

Sprint 2 Deliverables:

Sprint 3: Centralized Error Handling and Logging Standardization

Task 1: Implement Centralized Error Handling

Task 2: Standardize Logging Levels

Sprint 3 Deliverables:

Sprint 4: Custom Exceptions and Contextual Logging

Task 1: Introduce Custom Exceptions

Task 2: Implement Contextual Logging

Sprint 4 Deliverables:

Sprint 5: Improve Background Task Error Handling and Performance Logging

Task 1: Enhance Error Handling in Background Tasks

Task 2: Implement Performance Logging

Sprint 5 Deliverables:

Sprint 6: Final Validation and Cleanup

Task 1: Validate and Improve Input/Output Handling

Task 2: Clean Up Unused Code and Final Refactoring

Sprint 6 Deliverables:

Overall Outcome:

Sprint 2: Replace `print` Statements and Improve Error Logging

Task 1: Replace `print` Statements with Logging