DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
5 stars 2 forks source link

CloudWatch dashboard in `anvilprod` and `hammerbox` omits two ES nodes #6271

Open hannes-ucsc opened 1 month ago

hannes-ucsc commented 1 month ago
image image
hannes-ucsc commented 1 month ago

The template currently only supports up to four nodes. Convert dashboard JSON template to .json.template.py and use a loop up to aws.es_instance_count to generate the repetitive portions.

hannes-ucsc commented 1 month ago

For demo, attempt to reproduce.

hannes-ucsc commented 1 week ago

I had previously destroyed tempdev and wanted to recreate it for an experiment:

$ make lambdas
…
$ cd terraform
…
$ make validate 
python /Users/hannes/workspace/hca/azul.hannes2.local/scripts/check_branch.py
set -o pipefail && git ls-files --ignored --others --directory --exclude-standard | (grep -v '/[^/]' || test $? -eq 1) | xargs -r rm -rv
__pycache__//cloudwatch_dashboard.template.json.cpython-311.pyc
__pycache__/
api_gateway.tf.json
authentication.tf.json
backend.tf.json
cloudwatch.tf.json
common.tf.json
dynamo.tf.json
lambda_layer.tf.json
plan.bin
plan.json
providers.tf.json
s3.tf.json
sqs.tf.json
step_function.tf.json
python providers.tf.json.template.py providers.tf.json
Creating providers.tf.json
python backend.tf.json.template.py backend.tf.json
Creating backend.tf.json
terraform init -reconfigure -lockfile=readonly

Initializing the backend...

Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.

Initializing provider plugins...
- Reusing previous version of hashicorp/null from the dependency lock file
- Reusing previous version of hashicorp/google from the dependency lock file
- Reusing previous version of hashicorp/aws from the dependency lock file
- Reusing previous version of opensearch-project/opensearch from the dependency lock file
- Reusing previous version of hashicorp/external from the dependency lock file
- Using previously-installed hashicorp/null v3.2.0
- Using previously-installed hashicorp/google v4.58.0
- Using previously-installed hashicorp/aws v5.49.0
- Using previously-installed opensearch-project/opensearch v2.2.1
- Using previously-installed hashicorp/external v2.2.0

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
python /Users/hannes/workspace/hca/azul.hannes2.local/scripts/terraform_schema.py check \
        || (echo "Schema is stale. Run 'make update_schema' and commit." ; false)
2024-06-24 18:24:50,265    INFO MainThread azul.terraform: Running ['terraform', 'version', '-json']
2024-06-24 18:24:50,311    INFO MainThread azul.terraform: Terraform output:
{
  "terraform_version": "1.6.6",
  "platform": "darwin_arm64",
  "provider_selections": {
    "registry.terraform.io/hashicorp/aws": "5.49.0",
    "registry.terraform.io/hashicorp/external": "2.2.0",
    "registry.terraform.io/hashicorp/google": "4.58.0",
    "registry.terraform.io/hashicorp/null": "3.2.0",
    "registry.terraform.io/opensearch-project/opensearch": "2.2.1"
  },
  "terraform_outdated": true
}

python api_gateway.tf.json.template.py api_gateway.tf.json
Creating api_gateway.tf.json
python authentication.tf.json.template.py authentication.tf.json
Creating authentication.tf.json
python cloudwatch.tf.json.template.py cloudwatch.tf.json
[INFO]  2024-06-24T18:24:55.250Z        00010ca1-b0ba-466f-8c58-dabbad000000    azul.deployment Allocated new Boto3 client for 'sts' with ID 4438014288
[INFO]  2024-06-24T18:24:55.393Z        00010ca1-b0ba-466f-8c58-dabbad000000    azul.deployment Allocated new Boto3 client for 'es' with ID 4440666448
Traceback (most recent call last):
  File "/Users/hannes/workspace/hca/azul.hannes2.local/terraform/cloudwatch.tf.json.template.py", line 334, in <module>
    'dashboard_body': dashboard_body()
                      ^^^^^^^^^^^^^^^^
  File "/Users/hannes/workspace/hca/azul.hannes2.local/terraform/cloudwatch.tf.json.template.py", line 20, in dashboard_body
    module = load_module(config.cloudwatch_dashboard_template,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/hannes/workspace/hca/azul.hannes2.local/src/azul/modules.py", line 45, in load_module
    spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/Users/hannes/workspace/hca/azul.hannes2.local/terraform/cloudwatch_dashboard.template.json.py", line 329, in <module>
    'expression': ' + '.join(f'm{2 + i * 2}' for i in range(aws.es_instance_count)),
                                                            ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/hannes/workspace/hca/azul.hannes2.local/src/azul/deployment.py", line 230, in es_instance_count
    return self._es_domain_status['ElasticsearchClusterConfig']['InstanceCount']
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/hannes/workspace/hca/azul.hannes2.local/src/azul/deployment.py", line 98, in wrapper
    return cached_func(self.boto3_session, self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/hannes/workspace/hca/azul.hannes2.local/src/azul/deployment.py", line 95, in cached_func
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/hannes/workspace/hca/azul.hannes2.local/src/azul/deployment.py", line 238, in _es_domain_status
    es_domain = self.es.describe_elasticsearch_domain(DomainName=config.es_domain)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/hannes/workspace/hca/azul.hannes2.local/.venv/lib/python3.11/site-packages/botocore/client.py", line 535, in _api_call
    return self._make_api_call(operation_name, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/hannes/workspace/hca/azul.hannes2.local/.venv/lib/python3.11/site-packages/botocore/client.py", line 980, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.ResourceNotFoundException: An error occurred (ResourceNotFoundException) when calling the DescribeElasticsearchDomain operation: Domain not found: azul-index-tempdev
make: *** [cloudwatch.tf.json] Error 1

The above sequence of commands is part of make deploy so I believe this would also happen with that.

hannes-ucsc commented 1 week ago

This tied me over but may need more thought.

Index: terraform/cloudwatch_dashboard.template.json.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/terraform/cloudwatch_dashboard.template.json.py b/terraform/cloudwatch_dashboard.template.json.py
--- a/terraform/cloudwatch_dashboard.template.json.py   (revision de54dcb6841362843f0b90cf9bd5a251155db1a6)
+++ b/terraform/cloudwatch_dashboard.template.json.py   (date 1719382045387)
@@ -13,6 +13,12 @@
     aws,
 )

+es_instance_count = (
+    aws.es_instance_count
+    if config.share_es_domain else
+    config.es_instance_count
+)
+
 dashboard_body = {
     'widgets': [
         {
@@ -326,7 +332,7 @@
                 'metrics': [
                     [
                         {
-                            'expression': ' + '.join(f'm{2 + i * 2}' for i in range(aws.es_instance_count)),
+                            'expression': ' + '.join(f'm{2 + i * 2}' for i in range(es_instance_count)),
                             'label': 'Primary',
                             'id': 'e1',
                             'region': config.region,
@@ -335,7 +341,7 @@
                     ],
                     [
                         {
-                            'expression': ' + '.join(f'm{3 + i * 2}' for i in range(aws.es_instance_count)),
+                            'expression': ' + '.join(f'm{3 + i * 2}' for i in range(es_instance_count)),
                             'label': 'Replica',
                             'id': 'e2',
                             'region': config.region,
@@ -416,7 +422,7 @@
                                 }
                             ]
                         ]
-                        for i in range(1, aws.es_instance_count)
+                        for i in range(1, es_instance_count)
                     ))
                 ],
                 'view': 'timeSeries',
@@ -476,7 +482,7 @@
                             '.',
                             '.'
                         ]
-                        for i in range(1, aws.es_instance_count)
+                        for i in range(1, es_instance_count)
                     )
                 ],
                 'region': config.region,
@@ -495,7 +501,7 @@
                     [
                         {
                             'expression': 'DIFF(%s)/4/1000/60/5*100' %
-                                          '+'.join(f'm{i + 1}' for i in range(aws.es_instance_count)),
+                                          '+'.join(f'm{i + 1}' for i in range(es_instance_count)),
                             'label': 'Old generation',
                             'id': 'e2',
                             'region': config.region,
@@ -505,8 +511,8 @@
                     [
                         {
                             'expression': 'DIFF(%s)/4/1000/60/5*100' % '+'.join(
-                                f'm{i + aws.es_instance_count + 1}'
-                                for i in range(aws.es_instance_count)
+                                f'm{i + es_instance_count + 1}'
+                                for i in range(es_instance_count)
                             ),
                             'label': 'Young generation',
                             'id': 'e1',
@@ -540,7 +546,7 @@
                                 'visible': False
                             }
                         ]
-                        for i in range(1, aws.es_instance_count)
+                        for i in range(1, es_instance_count)
                     ),
                     [
                         '.',
@@ -552,7 +558,7 @@
                         '.',
                         '.',
                         {
-                            'id': f'm{aws.es_instance_count + 1}',
+                            'id': f'm{es_instance_count + 1}',
                             'visible': False
                         }
                     ],
@@ -563,11 +569,11 @@
                             '.',
                             '.',
                             {
-                                'id': f'm{i + aws.es_instance_count + 1}',
+                                'id': f'm{i + es_instance_count + 1}',
                                 'visible': False
                             }
                         ]
-                        for i in range(1, aws.es_instance_count)
+                        for i in range(1, es_instance_count)
                     )
                 ],
                 'view': 'timeSeries',
@@ -1742,4 +1748,3 @@
         }
     ]
 }
-
hannes-ucsc commented 3 days ago

A complete demo of the changes from PR #6369 would require destroying a shared deployment that has its own ES domain. Given that the PR uses the patch verbatim, and that patch worked for me, I don't think it will be necessary to demo those changes.