This PR introduces the RegionStatusMonitor, responsible for handling and publishing region request statuses to an SNS topic. This follows the pattern previously established with the ImageStatusMonitor, ensuring consistency in how status updates are handled within the Model Runner. The change addresses the current issue where the Model Runner is not writing status updates to the region status topic as expected. We are migrating existing messaging to a new consolidated class StatusMessage which updates the output property from image_status to status but are keeping the old image_status property available until model-runner v3.0, at which point it will be depreciated.
This update now allows ModelRunner to re-drive partially completed image/region requests. If any tiles in a region fail to process and those regions are re driven from the DLQ - only the tiles that failed to process will be run again. Likewise - if a user manually submits an image request with an identical job ID for a job where some tiles failed - only those tiles that failed will be reprocessed.
Key Changes:
SNS Publishing:
The RegionStatusMonitor encapsulates logic for publishing messages to an SNS topic, ensuring that each region's status is properly communicated as processing occurs.
Each status message includes relevant attributes such as region_id, status, processing_duration, job_id, and failed_tiles, ensuring clear communication of the region's processing state. To this end we have migrated all the appropriate table and message classes to use the same processing_durationfield - whereas before some instances where using processing_time.
Created new consolidated StatusMessage and BaseStatusMontior classes that support messaging to both the ImageStatus and RegionStatus topics.
Databases:
Added a method add_succeeded_tile in RegionRequestTable: This method allows appending values to an existing list property of a DynamoDB item for tiles that have succeeded for a region. It is run by the tile worker immediately after each tile had finished writing features to update the appropriate ReqionRequestItem.
The tile-worker.py and tile_worker_utils.py logic now writes both succeeded_tiles and failed_tiles as properties to the region request database items as part of their workflows for exposing tile status upstream to model runner.
Added expire_time property to the region request table items to allow the TTL setting to work as intended for this table. It is correctly being set in the RegionRequestTable now.
Added automatic conversion of Decimal values to native int/float types in methods interacting with DynamoDB (get_ddb_item(), update_ddb_item(), and query_items()).
Introduced a convert_decimal utility function that recursively processes retrieved items to ensure that numeric fields are represented with appropriate native Python types, preventing unexpected behavior caused by DynamoDB's use of Decimal.
Added from_region_request helper method to RegionRequestItem class for creating a RegionRequestItem instance from a RegionRequest object, mapping relevant fields including region_bounds, tile_size, tile_overlap, tile_format, and tile_compression.
Redriving Partial Requests:
Implemented tile filtering logic: Added filtering for ImageRegions based on the succeeded_tiles attribute in RegionRequestItem, ensuring only unprocessed tiles are passed to the tile queue.
Handled tile structure conversion: Adjusted the filtering to match the structure [[row, col], [width, height]], allowing accurate comparison between ImageRegions and the tiles marked as succeeded.
Added logging for skipped tiles: Added a log warning when tiles that have already been processed are filtered out, ensuring better visibility into redundant work avoidance.
Enabled re-driving of failed regions without reprocessing succeeded tiles: This change now allows for the re-driving of failed regions without needing to reprocess tiles that have already succeeded, improving efficiency and reducing redundant computation.
Unit Tests:
Updated unit tests to reflect new logic across updated code.
Including the region request topic usage as part of our end-to-end testing as part of `test_app.py``.
Bug Fix:
Fixed issue with the run_container.sh script - allowing it to run as intended.
Now that error reporting is done through the tile_worker.py`` logic - removederror_counttracking from corresponding*Detector` classes.
Added gdal.UseExceptions() to feature_utils.py to avoid GDAL throwing warnings when using this file.
Updating PyDoc comments on existing tables to capture all the latest greatest members.
Testing:
Verified end-to-end flow for region request processing to ensure the correct statuses are published.
Verified the submission of a region with previously succeeded tiles only loaded tiles that were not previously succeeded, this was accomplished by submitting an identical job_id where the previous job had already succeeded all the tiles.
This PR ensures that region request status updates are handled and communicated consistently, resolving the existing issue and aligning with the existing pattern for image status monitoring.
Checklist
Before you submit a pull request, please make sure you have the following:
[x] Code changes are compact and well-structured to facilitate easy review
[x] Changes are documented in the README.md and other relevant documentation pages
[x] PR title and description accurately reflect the changes and are detailed enough for historical tracking
[x] PR contains tests that cover all new code and the code has been manual tested
[x] All new dependencies are declared (if any), and no unnecessary libraries are added
[x] Performance impacts (if any) of the changes are evaluated and documented
[x] Security implications of the changes (if any) are reviewed and addressed
Issue #, if available: n/a
Notes
This PR introduces the
RegionStatusMonitor
, responsible for handling and publishing region request statuses to an SNS topic. This follows the pattern previously established with theImageStatusMonitor
, ensuring consistency in how status updates are handled within the Model Runner. The change addresses the current issue where the Model Runner is not writing status updates to the region status topic as expected. We are migrating existing messaging to a new consolidated classStatusMessage
which updates the output property fromimage_status
tostatus
but are keeping the oldimage_status
property available until model-runnerv3.0
, at which point it will be depreciated.This update now allows ModelRunner to re-drive partially completed image/region requests. If any tiles in a region fail to process and those regions are re driven from the DLQ - only the tiles that failed to process will be run again. Likewise - if a user manually submits an image request with an identical job ID for a job where some tiles failed - only those tiles that failed will be reprocessed.
Key Changes:
SNS Publishing:
RegionStatusMonitor
encapsulates logic for publishing messages to an SNS topic, ensuring that each region's status is properly communicated as processing occurs.region_id
,status
,processing_duration
,job_id
, andfailed_tiles
, ensuring clear communication of the region's processing state. To this end we have migrated all the appropriate table and message classes to use the sameprocessing_duration
field - whereas before some instances where usingprocessing_time
.StatusMessage
andBaseStatusMontior
classes that support messaging to both theImageStatus
andRegionStatus
topics.Databases:
add_succeeded_tile
inRegionRequestTable
: This method allows appending values to an existing list property of a DynamoDB item for tiles that have succeeded for a region. It is run by the tile worker immediately after each tile had finished writing features to update the appropriateReqionRequestItem
.tile-worker.py
andtile_worker_utils.py
logic now writes bothsucceeded_tiles
andfailed_tiles
as properties to the region request database items as part of their workflows for exposing tile status upstream to model runner.expire_time
property to the region request table items to allow the TTL setting to work as intended for this table. It is correctly being set in theRegionRequestTable
now.Decimal
values to nativeint
/float
types in methods interacting with DynamoDB (get_ddb_item()
,update_ddb_item()
, andquery_items()
).convert_decimal
utility function that recursively processes retrieved items to ensure that numeric fields are represented with appropriate native Python types, preventing unexpected behavior caused by DynamoDB's use ofDecimal
.from_region_request
helper method to RegionRequestItem class for creating a RegionRequestItem instance from a RegionRequest object, mapping relevant fields including region_bounds, tile_size, tile_overlap, tile_format, and tile_compression.Redriving Partial Requests:
Unit Tests:
Bug Fix:
run_container.sh
script - allowing it to run as intended.tile_worker.py`` logic - removed
error_counttracking from corresponding
*Detector` classes.gdal.UseExceptions()
tofeature_utils.py
to avoid GDAL throwing warnings when using this file.Testing:
This PR ensures that region request status updates are handled and communicated consistently, resolving the existing issue and aligning with the existing pattern for image status monitoring.
Checklist
Before you submit a pull request, please make sure you have the following:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.