I was expecting to see up to 16 jobs running on each node. However, only 10 were running concurrently. The frontends cluster view was also reporting that all 32 cpus had been used. @sanderegg is that to be expected?
Everything seems to work fine otherwise (I did not download the data, though) from a comp. backend point of view. However, the api-server gets exponentially slower when submitting 256 jobs at once and at some point it stops adding any jobs. @sanderegg I think you did not see that with the sleeper, right? The only thing that is different here is that I upload a 200 kB input file with each solver job. @pcrespov
large Antonino LF simulation
This is a input file of size 3 GB
It fails already when I try to use a pipeline. (FilePicker->iSolve). It fails after while (after downloading about half of the input file)
Fri Aug 25 2023 10:19:11 GMT+0200 (Central European Summer Time) DEBUG eaf48c93-5b61-5840-95f5-0dc79364c39b isolve-mpi edge: [sidecar] Downloading 'staging-simcore/68c40442-4313-11ee-bfc8-02420a0b9f10/eee881a4-fe82-4acf-ae2a-6d8703d1b945/0c35fda1-639a-4ff4-9ada-0eb44c7f681f_Input.h5': 51.6% (1.4GiB / 2.8GiB) [95.80 MBytes/s (avg)]
Fri Aug 25 2023 10:19:27 GMT+0200 (Central European Summer Time) ERROR eaf48c93-5b61-5840-95f5-0dc79364c39b isolve-mpi edge: The dask computational backend does not know about the task 'simcore/services/comp/isolve-mpi:2.1.21:userid_2:projectid_68c40442-4313-11ee-bfc8-02420a0b9f10:nodeid_eaf48c93-5b61-5840-95f5-0dc79364c39b:uuid_b376f8a7-d8c9-49a8-b88f-6e201be1166c'
The second time this went through. However, at the end I get this:
Fri Aug 25 2023 11:25:15 GMT+0200 (Central European Summer Time) DEBUG eaf48c93-5b61-5840-95f5-0dc79364c39b isolve-mpi edge: [sidecar] Uploading 'staging-simcore/68c40442-4313-11ee-bfc8-02420a0b9f10/eaf48c93-5b61-5840-95f5-0dc79364c39b/output.h5': 0.1% (10.0MiB / 8.5GiB)
Fri Aug 25 2023 11:25:17 GMT+0200 (Central European Summer Time) ERROR eaf48c93-5b61-5840-95f5-0dc79364c39b isolve-mpi edge: [sidecar] Task error:
400, message='Bad Request', url=URL('https://s3.amazonaws.com/staging-simcore/68c40442-4313-11ee-bfc8-02420a0b9f10/eaf48c93-5b61-5840-95f5-0dc79364c39b/output.h5?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIATHZO54NSKBIOBR5W/20230825/us-east-1/s3/aws4_request&X-Amz-Date=20230825T085331Z&X-Amz-Expires=43200&X-Amz-SignedHeaders=host&X-Amz-Signature=181f40f8a0f7e1cc41db7f6a1c68149fe83b219d09485cfcb286169913d8a476')
Fri Aug 25 2023 11:25:17 GMT+0200 (Central European Summer Time) INFO eaf48c93-5b61-5840-95f5-0dc79364c39b isolve-mpi edge: [sidecar] TIP: There might be more information in the service log file in the service outputs
Fri Aug 25 2023 11:25:22 GMT+0200 (Central European Summer Time) ERROR eaf48c93-5b61-5840-95f5-0dc79364c39b isolve-mpi edge: ClientResponseError(RequestInfo(url=URL('https://s3.amazonaws.com/staging-simcore/68c40442-4313-11ee-bfc8-02420a0b9f10/eaf48c93-5b61-5840-95f5-0dc79364c39b/output.h5?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIATHZO54NSKBIOBR5W/20230825/us-east-1/s3/aws4_request&X-Amz-Date=20230825T085331Z&X-Amz-Expires=43200&X-Amz-SignedHeaders=host&X-Amz-Signature=181f40f8a0f7e1cc41db7f6a1c68149fe83b219d09485cfcb286169913d8a476'), method='PUT', headers=<CIMultiDictProxy('Host': 's3.amazonaws.com', 'Content-Length': '9168212760', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'User-Agent': 'Python/3.10 aiohttp/3.8.5', 'Content-Type': 'application/octet-stream')>, real_url=URL('https://s3.amazonaws.com/staging-simcore/68c40442-4313-11ee-bfc8-02420a0b9f10/eaf48c93-5b61-5840-95f5-0dc79364c39b/output.h5?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIATHZO54NSKBIOBR5W/20230825/us-east-1/s3/aws4_request&X-Amz-Date=20230825T085331Z&X-Amz-Expires=43200&X-Amz-SignedHeaders=host&X-Amz-Signature=181f40f8a0f7e1cc41db7f6a1c68149fe83b219d09485cfcb286169913d8a476')), (), status=400, message='Bad Request', headers=<CIMultiDictProxy('x-amz-request-id': 'SFBQVKR3V73RXRFZ', 'x-amz-id-2': 'CFOXVtj7HSYkwPKIA9Fg0/DVJWmdCTR+Y6m+S14Qec0IRLeTarM2WkSo8+3BVke+Bx4/gnZ7g/4=', 'Content-Type': 'application/xml', 'Transfer-Encoding': 'chunked', 'Date': 'Fri, 25 Aug 2023 09:25:14 GMT', 'Server': 'AmazonS3', 'Connection': 'close')>)
sidecar:
log_level=INFO | log_timestamp=2023-08-25 09:25:17,035 | log_source=simcore_service_dask_sidecar.computational_sidecar.core:_publish_sidecar_log(175) | log_uid=None | log_msg=TIP: There might be more information in the service log file in the service outputs
log_level=ERROR | log_timestamp=2023-08-25 09:25:17,036 | log_source=distributed.protocol.pickle:dumps(83) | log_uid=None | log_msg=Failed to serialize 400, message='Bad Request', url=URL('https://s3.amazonaws.com/staging-simcore/68c40442-4313-11ee-bfc8-02420a0b9f10/eaf48c93-5b61-5840-95f5-0dc79364c39b/output.h5?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIATHZO54NSKBIOBR5W/20230825/us-east-1/s3/aws4_request&X-Amz-Date=20230825T085331Z&X-Amz-Expires=43200&X-Amz-SignedHeaders=host&X-Amz-Signature=181f40f8a0f7e1cc41db7f6a1c68149fe83b219d09485cfcb286169913d8a476').\nTraceback (most recent call last):\n File "/home/scu/.venv/lib/python3.10/site-packages/distributed/worker.py", line 3103, in apply_function_simple\n result = function(*args, **kwargs)\n File "/home/scu/.venv/lib/python3.10/site-packages/simcore_service_director_v2/modules/dask_client.py", line 220, in _comp_sidecar_fct\n File "/home/scu/.venv/lib/python3.10/site-packages/simcore_service_dask_sidecar/tasks.py", line 163, in run_computational_sidecar\n return asyncio.get_event_loop().run_until_complete(\n File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n return future.result()\n File "/home/scu/.venv/lib/python3.10/site-packages/simcore_service_dask_sidecar/tasks.py", line 133, in _run_computational_sidecar_async\n output_data = await sidecar.run(command=command)\n File "/home/scu/.venv/lib/python3.10/site-packages/simcore_service_dask_sidecar/computational_sidecar/core.py", line 255, in run\n results = await self._retrieve_output_data(\n File "/home/scu/.venv/lib/python3.10/site-packages/simcore_service_dask_sidecar/computational_sidecar/core.py", line 155, in _retrieve_output_data\n await asyncio.gather(*upload_tasks)\n File "/home/scu/.venv/lib/python3.10/site-packages/simcore_service_dask_sidecar/file_utils.py", line 263, in push_file_to_remote\n await _push_file_to_http_link(file_to_upload, dst_url, log_publishing_cb)\n File "/home/scu/.venv/lib/python3.10/site-packages/simcore_service_dask_sidecar/file_utils.py", line 179, in _push_file_to_http_link\n await fs._put_file( # pylint: disable=protected-access # noqa: SLF001\n File "/home/scu/.venv/lib/python3.10/site-packages/fsspec/implementations/http.py", line 308, in _put_file\n self._raise_not_found_for_status(resp, rpath)\n File "/home/scu/.venv/lib/python3.10/site-packages/fsspec/implementations/http.py", line 214, in _raise_not_found_for_status\n response.raise_for_status()\n File "/home/scu/.venv/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status\n raise ClientResponseError(\naiohttp.client_exceptions.ClientResponseError: 400, message='Bad Request', url=URL('https://s3.amazonaws.com/staging-simcore/68c40442-4313-11ee-bfc8-02420a0b9f10/eaf48c93-5b61-5840-95f5-0dc79364c39b/output.h5?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIATHZO54NSKBIOBR5W/20230825/us-east-1/s3/aws4_request&X-Amz-Date=20230825T085331Z&X-Amz-Expires=43200&X-Amz-SignedHeaders=host&X-Amz-Signature=181f40f8a0f7e1cc41db7f6a1c68149fe83b219d09485cfcb286169913d8a476')\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/home/scu/.venv/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 63, in dumps\n result = pickle.dumps(x, **dump_kwargs)\nTypeError: can't pickle multidict._multidict.CIMultiDictProxy objects\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/home/scu/.venv/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 68, in dumps\n pickler.dump(x)\nTypeError: can't pickle multidict._multidict.CIMultiDictProxy objects\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/home/scu/.venv/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 81, in dumps\n result = cloudpickle.dumps(x, **dump_kwargs)\n File "/home/scu/.venv/lib/python3.10/site-packages/cloudpickle/cloudpickle_fast.py", line 73, in dumps\n cp.dump(obj)\n File "/home/scu/.venv/lib/python3.10/site-packages/cloudpickle/cloudpickle_fast.py", line 632, in dump\n return Pickler.dump(self, obj)\nTypeError: can't pickle multidict._multidict.CIMultiDictProxy objects
log_level=WARNING | log_timestamp=2023-08-25 09:25:17,044 | log_source=distributed.worker:execute(2345) | log_uid=None | log_msg=Compute Failed\nKey: simcore/services/comp/isolve-mpi:2.1.21:userid_2:projectid_68c40442-4313-11ee-bfc8-02420a0b9f10:nodeid_eaf48c93-5b61-5840-95f5-0dc79364c39b:uuid_1d0eaad3-9143-4cbb-904e-2a163b8d10ed\nFunction: _comp_sidecar_fct\nargs: ()\nkwargs: {'docker_auth': DockerBasicAuth(server_address='registry.staging.osparc.io', username='admin', password=SecretStr('**********')), 'service_key': 'simcore/services/comp/isolve-mpi', 'service_version': '2.1.21', 'input_data': TaskInputData(__root__={'input_1': FileUrl(url=AnyUrl('https://s3.amazonaws.com/staging-simcore/68c40442-4313-11ee-bfc8-02420a0b9f10/eee881a4-fe82-4acf-ae2a-6d8703d1b945/0c35fda1-639a-4ff4-9ada-0eb44c7f681f_Input.h5?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIATHZO54NSKBIOBR5W/20230825/us-east-1/s3/aws4_request&X-Amz-Date=20230825T085331Z&X-Amz-Expires=43200&X-Amz-SignedHeaders=host&X-Amz-Signature=5b6042e95647a4ea785ec091d395d51a5254e52ad33efce3d985eaf54ee7f8fa', scheme='https', host='s3.amazonaws.com', tld='com', host_type='domain', path='/staging-simcore/68c40442-4313-11ee-bfc8-02420a0b9f10/eee881a4-fe82-4acf-ae2a-6d8703d1b945/0c35fda1-639a-4ff4-9ada-0eb44c7f681f_Input.h5', query='X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIATHZO54NSKBIOBR5W/202\nException: "ClientResponseError(RequestInfo(url=URL('https://s3.amazonaws.com/staging-simcore/68c40442-4313-11ee-bfc8-02420a0b9f10/eaf48c93-5b61-5840-95f5-0dc79364c39b/output.h5?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIATHZO54NSKBIOBR5W/20230825/us-east-1/s3/aws4_request&X-Amz-Date=20230825T085331Z&X-Amz-Expires=43200&X-Amz-SignedHeaders=host&X-Amz-Signature=181f40f8a0f7e1cc41db7f6a1c68149fe83b219d09485cfcb286169913d8a476'), method='PUT', headers=<CIMultiDictProxy('Host': 's3.amazonaws.com', 'Content-Length': '9168212760', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'User-Agent': 'Python/3.10 aiohttp/3.8.5', 'Content-Type': 'application/octet-stream')>, real_url=URL('https://s3.amazonaws.com/staging-simcore/68c40442-4313-11ee-bfc8-02420a0b9f10/eaf48c93-5b61-5840-95f5-0dc79364c39b/output.h5?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIATHZO54NSKBIOBR5W/20230825/us-east-1/s3/aws4_request&X-Amz-Date=20230825T085331Z&X-Amz-Expires=43200&X-Amz-SignedHeaders=host&X-Amz-Signature=181f40f8a0f7e1cc41db7f6a1c68149fe83b219d09485cfcb286169913d8a476')), (), status=400, message='Bad Request', headers=<CIMultiDictProxy('x-amz-request-id': 'SFBQVKR3V73RXRFZ', 'x-amz-id-2': 'CFOXVtj7HSYkwPKIA9Fg0/DVJWmdCTR+Y6m+S14Qec0IRLeTarM2WkSo8+3BVke+Bx4/gnZ7g/4=', 'Content-Type': 'application/xml', 'Transfer-Encoding': 'chunked', 'Date': 'Fri, 25 Aug 2023 09:25:14 GMT', 'Server': 'AmazonS3', 'Connection': 'close')>)"\n
Reason most probably the output file being 8.5 GB large @sanderegg?
concerning the number of jobs per machine, I think I have seen something similar with the sleepers also. I think there is something to dig deeper there. Not sure yet what is going on.
for the second part, no I did not observe that, I was mostly checking the CPU/RAM usage of the different services (checking if they were somehow capping at 100%), so I might have missed it. maybe @pcrespov , @bisgaard-itis we could design a benchmark test on the api-server to test that isolated?
All tests performed on personal cluster in aws-staging:
small parallel plates LF simulation
setup
r6a.8xlarge
(252 GB RAM, 32 CPUs each)results
large Antonino LF simulation
This is a input file of size 3 GB
sidecar:
Reason most probably the output file being 8.5 GB large @sanderegg?