Open dsotirho-ucsc opened 2 years ago
@hannes-ucsc: "Consider retrying on 502s from TDR."
This same failure was encountered in the post_deploy
step in a GitLab build. https://gitlab.dev.singlecell.gi.ucsc.edu/ucsc/azul/-/jobs/56588#L4193
This happened again in anvilbox
https://gitlab.anvil.gi.ucsc.edu/ucsc/azul/-/jobs/6505#L1248
This exception by Terra triggered the azul-service_5xx-anvildev
alarm.
CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/apigateway/azul-service-anvildev, /aws/lambda/azul-service-anvilbox
start-time: -10800s
end-time: 0s
query-string:
fields @timestamp, @message, @logStream, @log
| filter @message like /Unexpected response from https:\/\/data.terra.bio/
| sort @timestamp desc
| limit 1
[
{
"@timestamp": "2023-02-03 02:42:04.897",
"@message": "
Traceback (most recent call last):
File "/var/task/azul/service/source_service.py", line 102, in _get
result = response['Item']
KeyError: 'Item'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/task/azul/service/source_service.py", line 71, in list_sources
sources = self._get(cache_key)
File "/var/task/azul/service/source_service.py", line 104, in _get
raise NotFound(key)
azul.service.source_service.NotFound: Key not found: 'anvil:'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/task/chalice/app.py", line 1913, in _get_view_function_response
response = view_function(**function_args)
File "/var/task/app.py", line 1215, in repository_search
return app.repository_controller.search(catalog=app.catalog,
File "/var/task/azul/service/repository_controller.py", line 78, in search
filters = self.get_filters(catalog, authentication, filters)
File "/var/task/azul/service/source_controller.py", line 69, in get_filters
source_ids=self._list_source_ids(catalog, authentication))
File "/var/task/azul/service/source_controller.py", line 60, in _list_source_ids
sources = self.list_sources(catalog, authentication)
File "/var/task/azul/service/source_controller.py", line 45, in list_sources
sources = self._source_service.list_sources(catalog, authentication)
File "/var/task/azul/service/source_service.py", line 73, in list_sources
sources = list(plugin.list_sources(authentication))
File "/var/task/azul/plugins/repository/tdr.py", line 106, in list_sources
snapshots = tdr.snapshot_names_by_id(filter=filter)
File "/var/task/azul/terra.py", line 608, in snapshot_names_by_id
response = self._check_response(endpoint, response)
File "/var/task/azul/terra.py", line 570, in _check_response
raise TerraStatusException(endpoint, response)
azul.terra.TerraStatusException: ('Unexpected response from https://data.terra.bio/api/repository/v1/snapshots?offset=0&limit=1000&sort=created_date&direction=asc&filter=ANVIL_', 502, b'\n<html><head>\n<meta http-equiv="content-type" content="text/html;charset=utf-8">\n<title>502 Server Error</title>\n</head>\n<body text=#000000 bgcolor=#ffffff>\n<h1>Error: Server Error</h1>\n<h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>\n<h2></h2>\n</body></html>\n')
",
"@logStream": "2023/02/03/[$LATEST]c82197830b2d4c3cb7368d507201d82a",
"@log": "289950828509:/aws/lambda/azul-service-anvildev"
}
]
This exception by Terra also triggered the azul-service_5xx-prod alarm
.
azul.terra.TerraStatusException: ('Unexpected response from https://data.terra.bio/api/repository/v1/snapshots?offset=0&limit=1000&sort=created_date&direction=asc&filter=hca_prod_', 502, b'\n<html><head>\n<meta http-equiv="content-type" content="text/html;charset=utf-8">\n<title>502 Server Error</title>\n</head>\n<body text=#000000 bgcolor=#ffffff>\n<h1>Error: Server Error</h1>\n<h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>\n<h2></h2>\n</body></html>\n')
Seems to have been an actual outage:
The solution is to enable retries in Terra clients, but only during IT. This intersects with the time-boxing that we apply. The same condition that extends the timeout for Terra requests should be used to enable retries. This will also solve #5003 so the PR for this should be connected to both issues.
This happened again, but during the deploy
step in GitLab prod.
https://gitlab.azul.data.humancellatlas.org/ucsc/azul/-/jobs/35596
^^ retry failed again, with the same error. https://gitlab.azul.data.humancellatlas.org/ucsc/azul/-/jobs/35604#L933
https://gitlab.azul.data.humancellatlas.org/ucsc/azul/-/jobs/29717