DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
6 stars 2 forks source link

Race between layer and Lambda update #5927

Open achave11-ucsc opened 7 months ago

achave11-ucsc commented 7 months ago

The example below is from anvildev, however the same errors occurred on dev and anvilprod.

During the deploy job for the merge commit of PR #5909 (which updated the elasticsearch client from 7.10.1 to 7.17.9), an UnsupportedProductError occurred for both the indexercachehealth and servicecachehealth lambdas.

CloudWatch Insights logs):

[ERROR] UnsupportedProductError: The client noticed that the server is not a supported distribution of Elasticsearch
Traceback (most recent call last):
  File "/var/task/azul/chalice.py", line 166, in patched_event_source_handler
    return old_handler(self_, event, context)
  File "/var/task/chalice/app.py", line 1756, in __call__
    return self.handler(event_obj)
  File "/var/task/app.py", line 212, in update_health_cache
    app.health_controller.update_cache()
  File "/var/task/azul/health.py", line 138, in update_cache
    health_object = dict(time=time.time(), health=self._health.as_json_fast())
  File "/var/task/azul/health.py", line 308, in as_json_fast
    return self.as_json(p.key for p in self.fast_properties[self.lambda_name])
  File "/var/task/azul/health.py", line 181, in as_json
    json = {k: getattr(self, k) for k in sorted(keys)}
  File "/var/task/azul/health.py", line 181, in <dictcomp>
    json = {k: getattr(self, k) for k in sorted(keys)}
  File "/var/task/azul/health.py", line 73, in __get__
    return super().__get__(obj, objtype=objtype)
  File "/var/task/azul/caching.py", line 189, in __get__
    value = obj.__dict__[self.fget.__name__] = self.fget(obj)
  File "/var/task/azul/health.py", line 273, in elasticsearch
    'up': ESClientFactory.get().ping(),
  File "/opt/python/elasticsearch/client/utils.py", line 347, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "/opt/python/elasticsearch/client/__init__.py", line 280, in ping
    return self.transport.perform_request(
  File "/opt/python/elasticsearch/transport.py", line 421, in perform_request
    _ProductChecker.raise_error(self._verified_elasticsearch)
  File "/opt/python/elasticsearch/transport.py", line 638, in raise_error
    raise UnsupportedProductError(message)
achave11-ucsc commented 7 months ago

Assignee to populate description with symptoms.

achave11-ucsc commented 7 months ago

Assignee to consider next steps.

hannes-ucsc commented 6 months ago

Turns out that the function code and the layer are updated with different API actions:

image

The screenshot is of a spreadsheet that was imported from CloudWatch Insights rsults and massaged (the @timestamp of trail events is not the event time). The query used was

fields eventTime, @timestamp, eventName, requestParameters.functionName, @message
| filter @message like /ERROR|Task|INIT_START/ or eventSource = 'lambda.amazonaws.com'
| limit 1000

This shows that the errors occurred after the layer was updated with UpdateFunctionConfiguration20150331v2 and before the function code was updated with UpdateFunctionCode20150331v2. During that time, the new ES client library was used by the old code that didn't contain the monkey patch for disabling the server version check in the ES client library.

hannes-ucsc commented 6 months ago

Spike to try publish. Use the trail to show the API actions used by Terraform. It should include a call to PublishVersion after UpdateFunctionConfiguration and UpdateFunctionCode.

dsotirho-ucsc commented 4 months ago

Spike to try publish. Use the trail to show the API actions used by Terraform. It should include a call to PublishVersion after UpdateFunctionConfiguration and UpdateFunctionCode.

The PublishVersion action occurred after UpdateFunctionConfiguration and after (or at the same time) as UpdateFunctionCode.

Index: src/azul/terraform.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/terraform.py b/src/azul/terraform.py
--- a/src/azul/terraform.py (revision 79a96535d237122c76dbec6fd50932f24d2b3cf4)
+++ b/src/azul/terraform.py (date 1713459301678)
@@ -708,6 +708,7 @@
         for resource in resources['aws_lambda_function'].values():
             assert 'layers' not in resource
             resource['layers'] = ['${aws_lambda_layer_version.dependencies.arn}']
+            resource['publish'] = True
             env = config.es_endpoint_env(
                 es_endpoint=(
                     aws.es_endpoint
# log-group-names: azul-trail-dev    

  fields @timestamp, eventType, eventName, requestParameters.functionName
| filter @message like /PublishVersion|UpdateFunctionConfiguration|UpdateFunctionCode/
| filter userIdentity.arn like /dsotirho/
| filter eventName != 'StartQuery'
| sort @timestamp asc
| limit 1000

@timestamp eventType eventName requestParameters.functionName
2024-04-18 16:45:01.710 AwsApiCall UpdateFunctionConfiguration20150331v2 azul-indexer-daniel-indexercachehealth
2024-04-18 16:45:41.463 AwsApiCall UpdateFunctionConfiguration20150331v2 azul-indexer-daniel-aggregate
2024-04-18 16:46:26.276 AwsApiCall UpdateFunctionConfiguration20150331v2 azul-indexer-daniel-contribute_retry
2024-04-18 16:46:26.277 AwsApiCall UpdateFunctionConfiguration20150331v2 azul-indexer-daniel
2024-04-18 16:46:26.277 AwsApiCall UpdateFunctionConfiguration20150331v2 azul-service-daniel
2024-04-18 16:46:26.278 AwsApiCall UpdateFunctionConfiguration20150331v2 azul-service-daniel-manifest
2024-04-18 16:46:26.278 AwsApiCall UpdateFunctionConfiguration20150331v2 azul-indexer-daniel-contribute
2024-04-18 16:46:26.279 AwsApiCall PublishVersion20150331 azul-indexer-daniel
2024-04-18 16:46:26.279 AwsApiCall UpdateFunctionCode20150331v2 arn:aws:lambda:us-east-1:122796619775:function:azul-indexer-daniel
2024-04-18 16:46:26.280 AwsApiCall UpdateFunctionCode20150331v2 arn:aws:lambda:us-east-1:122796619775:function:azul-service-daniel-servicecachehealth
2024-04-18 16:47:11.819 AwsApiCall UpdateFunctionConfiguration20150331v2 azul-service-daniel-servicecachehealth
2024-04-18 16:47:11.820 AwsApiCall UpdateFunctionConfiguration20150331v2 azul-indexer-daniel-aggregate_retry
2024-04-18 16:47:11.823 AwsApiCall PublishVersion20150331 azul-service-daniel-servicecachehealth
2024-04-18 16:47:51.553 AwsApiCall PublishVersion20150331 azul-indexer-daniel-aggregate_retry
2024-04-18 16:48:36.308 AwsApiCall PublishVersion20150331 azul-service-daniel-manifest
2024-04-18 16:48:36.309 AwsApiCall UpdateFunctionCode20150331v2 arn:aws:lambda:us-east-1:122796619775:function:azul-service-daniel-manifest
2024-04-18 16:48:36.315 AwsApiCall UpdateFunctionCode20150331v2 arn:aws:lambda:us-east-1:122796619775:function:azul-indexer-daniel-aggregate_retry
2024-04-18 16:48:36.316 AwsApiCall UpdateFunctionCode20150331v2 arn:aws:lambda:us-east-1:122796619775:function:azul-indexer-daniel-contribute
2024-04-18 16:48:36.316 AwsApiCall PublishVersion20150331 azul-indexer-daniel-contribute
2024-04-18 16:48:36.320 AwsApiCall UpdateFunctionCode20150331v2 arn:aws:lambda:us-east-1:122796619775:function:azul-indexer-daniel-contribute_retry
2024-04-18 16:50:01.636 AwsApiCall PublishVersion20150331 azul-indexer-daniel-aggregate
2024-04-18 16:50:01.637 AwsApiCall PublishVersion20150331 azul-indexer-daniel-contribute_retry
2024-04-18 16:50:01.637 AwsApiCall UpdateFunctionCode20150331v2 arn:aws:lambda:us-east-1:122796619775:function:azul-indexer-daniel-indexercachehealth
2024-04-18 16:50:25.032 AwsApiCall UpdateFunctionCode20150331v2 arn:aws:lambda:us-east-1:122796619775:function:azul-indexer-daniel-aggregate
2024-04-18 16:50:46.957 AwsApiCall PublishVersion20150331 azul-indexer-daniel-indexercachehealth
2024-04-18 16:50:46.957 AwsApiCall PublishVersion20150331 azul-service-daniel
2024-04-18 16:50:46.958 AwsApiCall UpdateFunctionCode20150331v2 arn:aws:lambda:us-east-1:122796619775:function:azul-service-daniel

achave11-ucsc commented 4 months ago

Assignee to consider next steps.

hannes-ucsc commented 4 months ago

Assignee to move forward with publish.