Closed chrisdicaprio closed 2 years ago
I found the root cause of our problem with ToshiAPI (https://github.com/GNS-Science/nshm-toshi-api/issues/181) the one blocking your disaggs
We've hit the hard limit on DynamoDB object size (400kB) on one particular file object RmlsZToxMzY0MDY=
.
That happens to be an openquake config.zip archive file that has been used > 12,000 times (openquake jobs) and each new usage adds an entry to the list of references in the object. Until the capacity limit is hit then... BOOM.
NB we also see this message in logs:
[INFO] backoff: Backing off create(...) for 15.6s (pynamodb.exceptions.TransactWriteError: Failed to write transaction items)`
So a very simple short-term workaround will be to save a new version of that openquake configuration archive and use that for future dissagg openquake jobs. A proper fix requires a bit more thought, but as a minimum the error can be handled in a more elegant manner.
New API promoted to prod with much greater capacity for file_relatiions. NB this means the workaround described above is not needed now. at least until we have >80000 uses of one file :)
CLONED DISAGG https://us-east-1.console.aws.amazon.com/batch/home?region=us-east-1#jobs/detail/bc671bbe-dad4-4cf9-be01-38789a35c95d
job failed with a different error:
File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/app/nzshm-runzi/runzi/execute/openquake/oq_hazard_task.py", line 551, in <module>
task.run(**config)
File "/app/nzshm-runzi/runzi/execute/openquake/oq_hazard_task.py", line 220, in run
self.run_disaggregation(task_arguments, job_arguments, environment)
File "/app/nzshm-runzi/runzi/execute/openquake/oq_hazard_task.py", line 345, in run_disaggregation
solution_id = self._store_api_result(automation_task_id, ta_clean, oq_result, config_id,
File "/app/nzshm-runzi/runzi/execute/openquake/oq_hazard_task.py", line 173, in _store_api_result
csv_archive_id, post_url = self._toshi_api.file.create_file(oq_result['csv_archive'])
KeyError: 'csv_archive'
executed {'create_file': {'ok': True, 'file_result': {'id': 'RmlsZToxNzEwNjM=', 'file_name': 'task_args.json', 'file_size': 923, 'md5_digest': 'dARjGn+divdmvDkyrxS3LQ==', 'post_url': '{"acl": "public-read", "Content-MD5": "dARjGn+divdmvDkyrxS3LQ==", "Content-Type": "binary/octet-stream", "key": "FileData/171063/task_args.json", "AWSAccessKeyId": "ASIAWW53A7TBJP52FFXR", "x-amz-security-token": "IQoJb3JpZ2luX2VjEND//////////wEaDmFwLXNvdXRoZWFzdC0yIkcwRQIhAJScRbmTGbnKmNR9nbwSQ7Ht54xH1Q61Yu7dpQEiD9qHAiALCNvCEjW1giDrtsyPpPjwjJvykRqaqHeIlmqKIf5InCqXAwiZ//////////8BEAIaDDQ2MTU2NDM0NTUzOCIMeoRXGRYKwbduDrX8KusCZN2aoaTZozQttkkStZ+/pn3md0jF0Lb8T3AdD+wDpxVR+YutCRv3YD5oh0cZQLpcijVfa8xkqywSHQdWcf4rg860C2cBiLhP+4TA4D4yRhiPc2alN0KfhhvORkFOpI9wpPfqjDnUN2b0egSJGezGvER/EP+HN29W+JYpaPulpCftylskZI6fbtsX85f8iAEX1MYJw8oylkKFSAxB1vpvn0ATZnoWBZFZMYEKjht+HqZC6jdqRqg5OaX6q0dLsScuZbSP5Y+wNy/eNTe9Ki0DO6Lpy+AbW3MIfW4gXouN3FyZFAeCedAv7/ylNr3Eq86EbCZFKEgKgWt4CCL594Wr7vrA942x+Y3wEMTP7qw5OmmlSKFMGlteM2CaLW/tXiz+Q1k1XYVdP4n1Fal82W6zHdwEyOw/+uDCUFX84g6HZO84XrLbnLmTSEslHGV4cKEHyp+tHmxt/7C2pE1IkDlsi+f88Iq64azl/GiRMJCpnZoGOp0BpJxDOhWyUEIusdyneD6WJBj4UcQAl44I3PaSx6jjYLctx1qvKpbogcQLK2arXFBGX7uM/kJZ0PC7AVsVrZqHNpWT4e0XfCOyYxZ9oLnpRaWCFd+YWiZIlUv/Hzoxb9/wLcHPPZ7vmy8Yp7smAH/A9cKWrdR9PIqNrper5aavZiLubN1vjdyn68kpnQU01RJnCL0NJx4GQ3Jbh0rJUA==", "policy": "eyJleHBpcmF0aW9uIjogIjIwMjItMTAtMTNUMDE6MDQ6NDlaIiwgImNvbmRpdGlvbnMiOiBbeyJhY2wiOiAicHVibGljLXJlYWQifSwgWyJzdGFydHMtd2l0aCIsICIkQ29udGVudC1UeXBlIiwgIiJdLCBbInN0YXJ0cy13aXRoIiwgIiRDb250ZW50LU1ENSIsICIiXSwgeyJidWNrZXQiOiAibnpzaG0yMi10b3NoaS1hcGktcHJvZCJ9LCB7ImtleSI6ICJGaWxlRGF0YS8xNzEwNjMvdGFza19hcmdzLmpzb24ifSwgeyJ4LWFtei1zZWN1cml0eS10b2tlbiI6ICJJUW9KYjNKcFoybHVYMlZqRU5ELy8vLy8vLy8vL3dFYURtRndMWE52ZFhSb1pXRnpkQzB5SWtjd1JRSWhBSlNjUmJtVEdibkttTlI5bmJ3U1E3SHQ1NHhIMVE2MVl1N2RwUUVpRDlxSEFpQUxDTnZDRWpXMWdpRHJ0c3lQcFBqd2pKdnlrUnFhcUhlSWxtcUtJZjVJbkNxWEF3aVovLy8vLy8vLy8vOEJFQUlhRERRMk1UVTJORE0wTlRVek9DSU1lb1JYR1JZS3diZHVEclg4S3VzQ1pOMmFvYVRab3pRdHRra1N0WisvcG4zbWQwakYwTGI4VDNBZEQrd0RweFZSK1l1dENSdjNZRDVvaDBjWlFMcGNpalZmYTh4a3F5d1NIUWRXY2Y0cmc4NjBDMmNCaUxoUCs0VEE0RDR5UmhpUGMyYWxOMEtmaGh2T1JrRk9wSTl3cFBmcWpEblVOMmIwZWdTSkdlekd2RVIvRVArSE4yOVcrSllwYVB1bHBDZnR5bHNrWkk2ZmJ0c1g4NWY4aUFFWDFNWUp3OG95bGtLRlNBeEIxdnB2bjBBVFpub1dCWkZaTVlFS2podCtIcVpDNmpkcVJxZzVPYVg2cTBkTHNTY3VaYlNQNVkrd055L2VOVGU5S2kwRE82THB5K0FiVzNNSWZXNGdYb3VOM0Z5WkZBZUNlZEF2Ny95bE5yM0VxODZFYkNaRktFZ0tnV3Q0Q0NMNTk0V3I3dnJBOTQyeCtZM3dFTVRQN3F3NU9tbWxTS0ZNR2x0ZU0yQ2FMVy90WGl6K1ExazFYWVZkUDRuMUZhbDgyVzZ6SGR3RXlPdy8rdURDVUZYODRnNkhaTzg0WHJMYm5MbVRTRXNsSEdWNGNLRUh5cCt0SG14dC83QzJwRTFJa0Rsc2krZjg4SXE2NGF6bC9HaVJNSkNwblpvR09wMEJwSnhET2hXeVVFSXVzZHluZUQ2V0pCajRVY1FBbDQ0STNQYVN4NmpqWUxjdHgxcXZLcGJvZ2NRTEsyYXJYRkJHWDd1TS9rSlowUEM3QVZzVnJacUhOcFdUNGUwWGZDT3lZeFo5b0xucFJhV0NGZCtZV2laSWxVdi9Iem94Yjkvd0xjSFBQWjd2bXk4WXA3c21BSC9BOWNLV3JkUjlQSXFOcnBlcjVhYXZaaUx1Yk4xdmpkeW42OGtwblFVMDFSSm5DTDBOSng0R1EzSmJoMHJKVUE9PSJ9XX0=", "signature": "j1egQjkPLowbFiNUeLs0LVLZMEE="}', 'meta': None}}}
it's got past the failure point and is running oq-engine.
When does problem occur: when running disaggregations from runzi using
run_oq_disagg.py
inoq_hazard_task.py
BuilderTask._save_config()
Log: