DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

IT failure: unexpected proxy error response from Terra #4684

Open dsotirho-ucsc opened 2 years ago

dsotirho-ucsc commented 2 years ago

https://gitlab.azul.data.humancellatlas.org/ucsc/azul/-/jobs/29717

ERROR: test_can_bundle_configured_catalogs (integration_test.CanBundleScriptIntegrationTest) (catalog='dcp21-it', repository=Config.Catalog.Plugin(name='tdr_hca'))
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/builds/ucsc/azul/test/integration_test.py", line 1459, in test_can_bundle_configured_catalogs
    self._test_catalog(catalog)
  File "/builds/ucsc/azul/test/integration_test.py", line 1420, in _test_catalog
    self._can_bundle(source=str(fqid.source.spec),
  File "/builds/ucsc/azul/test/integration_test.py", line 1498, in _can_bundle
    return self._can_bundle_main(args)
  File "/builds/ucsc/azul/scripts/can_bundle.py", line 63, in main
    bundle = fetch_bundle(args.source, args.uuid, args.version)
  File "/builds/ucsc/azul/scripts/can_bundle.py", line 77, in fetch_bundle
    configured_source = plugin.resolve_source(configured_source)
  File "/builds/ucsc/azul/src/azul/plugins/__init__.py", line 471, in resolve_source
    id = self.lookup_source_id(spec)
  File "/builds/ucsc/azul/src/azul/plugins/repository/tdr.py", line 154, in lookup_source_id
    return self.tdr.lookup_source(spec).id
  File "/builds/ucsc/azul/src/azul/terra.py", line 444, in lookup_source
    source = self._lookup_source(source_spec)
  File "/builds/ucsc/azul/src/azul/terra.py", line 476, in _lookup_source
    response = self._check_response(endpoint, response)
  File "/builds/ucsc/azul/src/azul/terra.py", line 568, in _check_response
    raise TerraStatusException(endpoint, response)
azul.terra.TerraStatusException: ('Unexpected response from [https://data.terra.bio/api/repository/v1/snapshots?filter=hca_prod_2b38025da5ea4c0fb22e367824bcaf4c__20220111_dcp2_20220331_dcp15&limit=2'](https://data.terra.bio/api/repository/v1/snapshots?filter=hca_prod_2b38025da5ea4c0fb22e367824bcaf4c__20220111_dcp2_20220331_dcp15&limit=2%27), 502, b'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head><script src="https://us.jsagent.tcell.insight.rapid7.com/tcellagent.min.js" tcellappid="FCProd-EBOgM" tcellapikey="AQQBBAFLGLOxL7VE9IF9ESlLvCxD5Ykr_7xkQKq_rgn_P58IWjOhOzIh6p3aI4pTWaprlUw" tcellbaseurl="https://us.agent.tcell.insight.rapid7.com/api/v1"></script>\n<title>502 Proxy Error</title>\n</head><body>\n<h1>Proxy Error</h1>\n<p>The proxy server received an invalid\r\nresponse from an upstream server.<br />\r\nThe proxy server could not handle the request<p>Reason: <strong>Error reading from remote server</strong></p></p>\n<hr>\n<address>Apache Server at data.terra.bio Port 80</address>\n</body></html>\n')

Proxy Error

The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request

Reason: Error reading from remote server


Apache Server at data.terra.bio Port 80

dsotirho-ucsc commented 2 years ago

@hannes-ucsc: "Consider retrying on 502s from TDR."

achave11-ucsc commented 2 years ago

This same failure was encountered in the post_deploy step in a GitLab build. https://gitlab.dev.singlecell.gi.ucsc.edu/ucsc/azul/-/jobs/56588#L4193

achave11-ucsc commented 1 year ago

This happened again in anvilbox https://gitlab.anvil.gi.ucsc.edu/ucsc/azul/-/jobs/6505#L1248

achave11-ucsc commented 1 year ago

This exception by Terra triggered the azul-service_5xx-anvildev alarm.

CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/apigateway/azul-service-anvildev, /aws/lambda/azul-service-anvilbox
start-time: -10800s
end-time: 0s
query-string:

fields @timestamp, @message, @logStream, @log
| filter @message like /Unexpected response from https:\/\/data.terra.bio/
| sort @timestamp desc
| limit 1
[
    {
        "@timestamp": "2023-02-03 02:42:04.897",
        "@message": "
        Traceback (most recent call last):
        File "/var/task/azul/service/source_service.py", line 102, in _get
        result = response['Item']
        KeyError: 'Item'

        During handling of the above exception, another exception occurred:

        Traceback (most recent call last):
        File "/var/task/azul/service/source_service.py", line 71, in list_sources
        sources = self._get(cache_key)
        File "/var/task/azul/service/source_service.py", line 104, in _get
        raise NotFound(key)
        azul.service.source_service.NotFound: Key not found: 'anvil:'

        During handling of the above exception, another exception occurred:

        Traceback (most recent call last):
        File "/var/task/chalice/app.py", line 1913, in _get_view_function_response
        response = view_function(**function_args)
        File "/var/task/app.py", line 1215, in repository_search
        return app.repository_controller.search(catalog=app.catalog,
        File "/var/task/azul/service/repository_controller.py", line 78, in search
        filters = self.get_filters(catalog, authentication, filters)
        File "/var/task/azul/service/source_controller.py", line 69, in get_filters
        source_ids=self._list_source_ids(catalog, authentication))
        File "/var/task/azul/service/source_controller.py", line 60, in _list_source_ids
        sources = self.list_sources(catalog, authentication)
        File "/var/task/azul/service/source_controller.py", line 45, in list_sources
        sources = self._source_service.list_sources(catalog, authentication)
        File "/var/task/azul/service/source_service.py", line 73, in list_sources
        sources = list(plugin.list_sources(authentication))
        File "/var/task/azul/plugins/repository/tdr.py", line 106, in list_sources
        snapshots = tdr.snapshot_names_by_id(filter=filter)
        File "/var/task/azul/terra.py", line 608, in snapshot_names_by_id
        response = self._check_response(endpoint, response)
        File "/var/task/azul/terra.py", line 570, in _check_response
        raise TerraStatusException(endpoint, response)
        azul.terra.TerraStatusException: ('Unexpected response from https://data.terra.bio/api/repository/v1/snapshots?offset=0&limit=1000&sort=created_date&direction=asc&filter=ANVIL_', 502, b'\n<html><head>\n<meta http-equiv="content-type" content="text/html;charset=utf-8">\n<title>502 Server Error</title>\n</head>\n<body text=#000000 bgcolor=#ffffff>\n<h1>Error: Server Error</h1>\n<h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>\n<h2></h2>\n</body></html>\n')
        ",
        "@logStream": "2023/02/03/[$LATEST]c82197830b2d4c3cb7368d507201d82a",
        "@log": "289950828509:/aws/lambda/azul-service-anvildev"
    }
]
achave11-ucsc commented 1 year ago

This exception by Terra also triggered the azul-service_5xx-prod alarm.

azul.terra.TerraStatusException:  ('Unexpected response from  https://data.terra.bio/api/repository/v1/snapshots?offset=0&limit=1000&sort=created_date&direction=asc&filter=hca_prod_',  502, b'\n<html><head>\n<meta http-equiv="content-type"  content="text/html;charset=utf-8">\n<title>502 Server  Error</title>\n</head>\n<body text=#000000  bgcolor=#ffffff>\n<h1>Error: Server  Error</h1>\n<h2>The server encountered a temporary error and  could not complete your request.<p>Please try again in 30  seconds.</h2>\n<h2></h2>\n</body></html>\n')
hannes-ucsc commented 1 year ago

Seems to have been an actual outage:

image
hannes-ucsc commented 1 year ago

The solution is to enable retries in Terra clients, but only during IT. This intersects with the time-boxing that we apply. The same condition that extends the timeout for Terra requests should be used to enable retries. This will also solve #5003 so the PR for this should be connected to both issues.

achave11-ucsc commented 1 year ago

This happened again, but during the deploy step in GitLab prod. https://gitlab.azul.data.humancellatlas.org/ucsc/azul/-/jobs/35596

achave11-ucsc commented 1 year ago

^^ retry failed again, with the same error. https://gitlab.azul.data.humancellatlas.org/ucsc/azul/-/jobs/35604#L933

hannes-ucsc commented 1 year ago

https://gitlab.dev.singlecell.gi.ucsc.edu/ucsc/azul/-/jobs/64122

hannes-ucsc commented 1 year ago

https://gitlab.dev.singlecell.gi.ucsc.edu/ucsc/azul/-/jobs/65424

achave11-ucsc commented 1 year ago

https://gitlab.dev.singlecell.gi.ucsc.edu/ucsc/azul/-/jobs/66578