GNS-Science / nshm-toshi-api

An extensible API where task metadata, and important input and output files relating to data-intensive science processes are retained. Custom task schemas can be defined to support their meta-data needs.
GNU Affero General Public License v3.0
0 stars 0 forks source link

FIX: Max retries exceeded on create_file_relation #181

Closed chrisdicaprio closed 1 year ago

chrisdicaprio commented 1 year ago

When does problem occur: when running disaggregations from runzi using run_oq_disagg.py in oq_hazard_task.py BuilderTask._save_config() Log:

INFO:__main__:sources: ['SW52ZXJzaW9uU29sdXRpb25Ocm1sOjEyMDk0NA==', 'RmlsZToxMzA3MTk=']
DEBUG:nshm_toshi_client.toshi_client_base:query: 
            mutation ($created: DateTime!, $source_models: [ID]!, $archive_id: ID!) {
              create_openquake_hazard_config(
                  input: {
                      created: $created
                      source_models: $source_models
                      template_archive: $archive_id
                  }
              )
              {
                ok
                config { id, created, source_models {
                  ... on Node { id } }
                }
              }
            }

DEBUG:nshm_toshi_client.toshi_client_base:variable_values: {'source_models': ['SW52ZXJzaW9uU29sdXRpb25Ocm1sOjEyMDk0NA==', 'RmlsZToxMzA3MTk='], 'archive_id': 'RmlsZToxMzY0MDY=', 'created': '2022-10-05T20:10:12.776531+00:00'}
DEBUG:nshm_toshi_client.toshi_client_base:query: 
        mutation create_file_relation(
            $thing_id:ID!
            $file_id:ID!
            $role:FileRole!) {
              create_file_relation(
                file_id:$file_id
                thing_id:$thing_id
                role:$role
              )
            {
              ok
            }
        }
DEBUG:nshm_toshi_client.toshi_client_base:variable_values: {'thing_id': 'T3BlbnF1YWtlSGF6YXJkQ29uZmlnOjE3Mzc4Ng==', 'file_id': 'RmlsZToxMzY0MDY=', 'role': 'READ'}
Traceback (most recent call last):
  File "/opt/openquake/lib/python3.9/site-packages/requests/adapters.py", line 440, in send
    resp = conn.urlopen(
  File "/opt/openquake/lib/python3.9/site-packages/urllib3/connectionpool.py", line 878, in urlopen
    return self.urlopen(
  File "/opt/openquake/lib/python3.9/site-packages/urllib3/connectionpool.py", line 878, in urlopen
    return self.urlopen(
  File "/opt/openquake/lib/python3.9/site-packages/urllib3/connectionpool.py", line 878, in urlopen
    return self.urlopen(
  [Previous line repeated 3 more times]
  File "/opt/openquake/lib/python3.9/site-packages/urllib3/connectionpool.py", line 868, in urlopen
    retries = retries.increment(method, url, response=response, _pool=self)
  File "/opt/openquake/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='aihssdkef5.execute-api.ap-southeast-2.amazonaws.com', port=443): Max retries exceeded with url: /prod/graphql (Caused by ResponseError('too many 504 error responses'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/app/nzshm-runzi/runzi/execute/openquake/oq_hazard_task.py", line 551, in <module>
    task.run(**config)
  File "/app/nzshm-runzi/runzi/execute/openquake/oq_hazard_task.py", line 220, in run
    self.run_disaggregation(task_arguments, job_arguments, environment)
  File "/app/nzshm-runzi/runzi/execute/openquake/oq_hazard_task.py", line 263, in run_disaggregation
    config_id = self._save_config(archive_id, nrml_id_list)
  File "/app/nzshm-runzi/runzi/execute/openquake/oq_hazard_task.py", line 115, in _save_config
    self._toshi_api.openquake_hazard_config.create_archive_file_relation(
  File "/app/nzshm-runzi/runzi/automation/scaling/toshi_api/openquake_hazard/openquake_hazard_config.py", line 63, in create_archive_file_relation
    executed = self.api.run_query(qry, variables)
  File "/opt/openquake/lib/python3.9/site-packages/nshm_toshi_client/toshi_client_base.py", line 78, in run_query
    response = self._client.execute(gql_query, variable_values)
  File "/opt/openquake/lib/python3.9/site-packages/gql/client.py", line 403, in execute
    return self.execute_sync(
  File "/opt/openquake/lib/python3.9/site-packages/gql/client.py", line 221, in execute_sync
    return session.execute(
  File "/opt/openquake/lib/python3.9/site-packages/gql/client.py", line 849, in execute
    result = self._execute(
  File "/opt/openquake/lib/python3.9/site-packages/gql/client.py", line 758, in _execute
    result = self.transport.execute(
  File "/opt/openquake/lib/python3.9/site-packages/gql/transport/requests.py", line 220, in execute
    response = self.session.request(
  File "/opt/openquake/lib/python3.9/site-packages/requests/sessions.py", line 529, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/openquake/lib/python3.9/site-packages/requests/sessions.py", line 645, in send
    r = adapter.send(request, **kwargs)
  File "/opt/openquake/lib/python3.9/site-packages/requests/adapters.py", line 510, in send
    raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='aihssdkef5.execute-api.ap-southeast-2.amazonaws.com', port=443): Max retries exceeded with url: /prod/graphql (Caused by ResponseError('too many 504 error responses'))
chrisbc commented 1 year ago

I found the root cause of our problem with ToshiAPI (https://github.com/GNS-Science/nshm-toshi-api/issues/181) the one blocking your disaggs

We've hit the hard limit on DynamoDB object size (400kB) on one particular file object RmlsZToxMzY0MDY=. That happens to be an openquake config.zip archive file that has been used > 12,000 times (openquake jobs) and each new usage adds an entry to the list of references in the object. Until the capacity limit is hit then... BOOM. NB we also see this message in logs:

[INFO] backoff: Backing off create(...) for 15.6s (pynamodb.exceptions.TransactWriteError: Failed to write transaction items)`

So a very simple short-term workaround will be to save a new version of that openquake configuration archive and use that for future dissagg openquake jobs. A proper fix requires a bit more thought, but as a minimum the error can be handled in a more elegant manner.

chrisbc commented 1 year ago

New API promoted to prod with much greater capacity for file_relatiions. NB this means the workaround described above is not needed now. at least until we have >80000 uses of one file :)

RETRY PROD jobs

DISAGGREGATION

CLONED DISAGG https://us-east-1.console.aws.amazon.com/batch/home?region=us-east-1#jobs/detail/bc671bbe-dad4-4cf9-be01-38789a35c95d

New job, https://us-east-1.console.aws.amazon.com/batch/home?region=us-east-1#jobs/detail/f8cb4b70-1447-4b6a-ba4b-ed12cca101ab

job failed with a different error:

  File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/app/nzshm-runzi/runzi/execute/openquake/oq_hazard_task.py", line 551, in <module>
    task.run(**config)
  File "/app/nzshm-runzi/runzi/execute/openquake/oq_hazard_task.py", line 220, in run
    self.run_disaggregation(task_arguments, job_arguments, environment)
  File "/app/nzshm-runzi/runzi/execute/openquake/oq_hazard_task.py", line 345, in run_disaggregation
    solution_id = self._store_api_result(automation_task_id, ta_clean, oq_result, config_id,
  File "/app/nzshm-runzi/runzi/execute/openquake/oq_hazard_task.py", line 173, in _store_api_result
    csv_archive_id, post_url = self._toshi_api.file.create_file(oq_result['csv_archive'])
KeyError: 'csv_archive'
executed {'create_file': {'ok': True, 'file_result': {'id': 'RmlsZToxNzEwNjM=', 'file_name': 'task_args.json', 'file_size': 923, 'md5_digest': 'dARjGn+divdmvDkyrxS3LQ==', 'post_url': '{"acl": "public-read", "Content-MD5": "dARjGn+divdmvDkyrxS3LQ==", "Content-Type": "binary/octet-stream", "key": "FileData/171063/task_args.json", "AWSAccessKeyId": "ASIAWW53A7TBJP52FFXR", "x-amz-security-token": "IQoJb3JpZ2luX2VjEND//////////wEaDmFwLXNvdXRoZWFzdC0yIkcwRQIhAJScRbmTGbnKmNR9nbwSQ7Ht54xH1Q61Yu7dpQEiD9qHAiALCNvCEjW1giDrtsyPpPjwjJvykRqaqHeIlmqKIf5InCqXAwiZ//////////8BEAIaDDQ2MTU2NDM0NTUzOCIMeoRXGRYKwbduDrX8KusCZN2aoaTZozQttkkStZ+/pn3md0jF0Lb8T3AdD+wDpxVR+YutCRv3YD5oh0cZQLpcijVfa8xkqywSHQdWcf4rg860C2cBiLhP+4TA4D4yRhiPc2alN0KfhhvORkFOpI9wpPfqjDnUN2b0egSJGezGvER/EP+HN29W+JYpaPulpCftylskZI6fbtsX85f8iAEX1MYJw8oylkKFSAxB1vpvn0ATZnoWBZFZMYEKjht+HqZC6jdqRqg5OaX6q0dLsScuZbSP5Y+wNy/eNTe9Ki0DO6Lpy+AbW3MIfW4gXouN3FyZFAeCedAv7/ylNr3Eq86EbCZFKEgKgWt4CCL594Wr7vrA942x+Y3wEMTP7qw5OmmlSKFMGlteM2CaLW/tXiz+Q1k1XYVdP4n1Fal82W6zHdwEyOw/+uDCUFX84g6HZO84XrLbnLmTSEslHGV4cKEHyp+tHmxt/7C2pE1IkDlsi+f88Iq64azl/GiRMJCpnZoGOp0BpJxDOhWyUEIusdyneD6WJBj4UcQAl44I3PaSx6jjYLctx1qvKpbogcQLK2arXFBGX7uM/kJZ0PC7AVsVrZqHNpWT4e0XfCOyYxZ9oLnpRaWCFd+YWiZIlUv/Hzoxb9/wLcHPPZ7vmy8Yp7smAH/A9cKWrdR9PIqNrper5aavZiLubN1vjdyn68kpnQU01RJnCL0NJx4GQ3Jbh0rJUA==", "policy": "eyJleHBpcmF0aW9uIjogIjIwMjItMTAtMTNUMDE6MDQ6NDlaIiwgImNvbmRpdGlvbnMiOiBbeyJhY2wiOiAicHVibGljLXJlYWQifSwgWyJzdGFydHMtd2l0aCIsICIkQ29udGVudC1UeXBlIiwgIiJdLCBbInN0YXJ0cy13aXRoIiwgIiRDb250ZW50LU1ENSIsICIiXSwgeyJidWNrZXQiOiAibnpzaG0yMi10b3NoaS1hcGktcHJvZCJ9LCB7ImtleSI6ICJGaWxlRGF0YS8xNzEwNjMvdGFza19hcmdzLmpzb24ifSwgeyJ4LWFtei1zZWN1cml0eS10b2tlbiI6ICJJUW9KYjNKcFoybHVYMlZqRU5ELy8vLy8vLy8vL3dFYURtRndMWE52ZFhSb1pXRnpkQzB5SWtjd1JRSWhBSlNjUmJtVEdibkttTlI5bmJ3U1E3SHQ1NHhIMVE2MVl1N2RwUUVpRDlxSEFpQUxDTnZDRWpXMWdpRHJ0c3lQcFBqd2pKdnlrUnFhcUhlSWxtcUtJZjVJbkNxWEF3aVovLy8vLy8vLy8vOEJFQUlhRERRMk1UVTJORE0wTlRVek9DSU1lb1JYR1JZS3diZHVEclg4S3VzQ1pOMmFvYVRab3pRdHRra1N0WisvcG4zbWQwakYwTGI4VDNBZEQrd0RweFZSK1l1dENSdjNZRDVvaDBjWlFMcGNpalZmYTh4a3F5d1NIUWRXY2Y0cmc4NjBDMmNCaUxoUCs0VEE0RDR5UmhpUGMyYWxOMEtmaGh2T1JrRk9wSTl3cFBmcWpEblVOMmIwZWdTSkdlekd2RVIvRVArSE4yOVcrSllwYVB1bHBDZnR5bHNrWkk2ZmJ0c1g4NWY4aUFFWDFNWUp3OG95bGtLRlNBeEIxdnB2bjBBVFpub1dCWkZaTVlFS2podCtIcVpDNmpkcVJxZzVPYVg2cTBkTHNTY3VaYlNQNVkrd055L2VOVGU5S2kwRE82THB5K0FiVzNNSWZXNGdYb3VOM0Z5WkZBZUNlZEF2Ny95bE5yM0VxODZFYkNaRktFZ0tnV3Q0Q0NMNTk0V3I3dnJBOTQyeCtZM3dFTVRQN3F3NU9tbWxTS0ZNR2x0ZU0yQ2FMVy90WGl6K1ExazFYWVZkUDRuMUZhbDgyVzZ6SGR3RXlPdy8rdURDVUZYODRnNkhaTzg0WHJMYm5MbVRTRXNsSEdWNGNLRUh5cCt0SG14dC83QzJwRTFJa0Rsc2krZjg4SXE2NGF6bC9HaVJNSkNwblpvR09wMEJwSnhET2hXeVVFSXVzZHluZUQ2V0pCajRVY1FBbDQ0STNQYVN4NmpqWUxjdHgxcXZLcGJvZ2NRTEsyYXJYRkJHWDd1TS9rSlowUEM3QVZzVnJacUhOcFdUNGUwWGZDT3lZeFo5b0xucFJhV0NGZCtZV2laSWxVdi9Iem94Yjkvd0xjSFBQWjd2bXk4WXA3c21BSC9BOWNLV3JkUjlQSXFOcnBlcjVhYXZaaUx1Yk4xdmpkeW42OGtwblFVMDFSSm5DTDBOSng0R1EzSmJoMHJKVUE9PSJ9XX0=", "signature": "j1egQjkPLowbFiNUeLs0LVLZMEE="}', 'meta': None}}}
chrisbc commented 1 year ago

HAZARD

CLONED https://us-east-1.console.aws.amazon.com/batch/home?region=us-east-1#jobs/detail/364bca3e-9b1c-4a2b-8978-8f65d70001db

new job: https://us-east-1.console.aws.amazon.com/batch/home?region=us-east-1#jobs/detail/7b148c65-012c-4758-b68c-23fc2cfca0f1

it's got past the failure point and is running oq-engine.