aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
433 stars 186 forks source link

Make `safe_interval` more dynamic for quick transport tasks #6544

Open GeigerJ2 opened 2 months ago

GeigerJ2 commented 2 months ago

As realized together with @giovannipizzi while debugging things for our new cluster at PSI: When submitting a simple calculation (execution takes about 10s) for testing purposes, with the default safe_interval=30 in the Computer configuration, one has to wait an additional 90s until the job is done (30s for the upload, submit, and retrieve tasks, each). This is to be expected, of course, and one could just reduce the safe_interval (albeit increasing the risk of SSH overloads).

However, the upload task in that case is truly the first Transport task that is being executed by the daemon worker, so it could, in principle, enter immediately (the same if jobs were run previously, but longer ago than the safe_interval). I locally implemented a first version (thanks to @giovannipizzi's input) that does this, by adding a last_close_time attribute (currently added to the authinfo metadata for a first PoC). In the request_transport method of the TransportQueue, the time difference between the current time and the last_close_time is then checked, and if it is larger than safe_interval, the Transport is opened immediately via:

open_callback_handle = self._loop.call_later(0, do_open, context=contextvars.Context())  # or use 1 for safety?

bypassing the safe_interval (or safe_open_interval as it is called in transports.py).

In addition, the waiting times for the submit and retrieve tasks could also be reduced. It seems like currently, the safe_interval is imposed on all of them, even if they finish very quickly (I assume as all open a transport connection via SSH). So we were thinking if it's possible to make this a bit more sophisticated, e.g. by adding special transport requests, that could make use of the open transport, and keep a transport of which the task has finished open for a short time longer (also quickly discussed with @mbercx). Of course, one would still need to make sure SSH doesn't get overloaded, the implementation works with heavy loads (not just individual testing calculations), and one would also have to consider how this all works with multiple daemon workers. Again with @giovannipizzi, I had a quick look, but it seems like the implementation would be a bit more involved. So wondering what the others think, if this is feasible and worth investigating more time into. Pinging @khsrali who has looked a bit more into transports.

giovannipizzi commented 2 months ago

Thanks for the nice write-up @GeigerJ2 ! Just some minor additional comments/clarifications

khsrali commented 1 week ago

So regarding this, I did a test to see how critically this issue impacts our performance.

I configured a computer with core.ssh as a transport plugin and core.direct as scheduler (ssh to my own machine). I then submitted 225 simple jobs. Each job does an arithmetic add and make 4 files, each 1KB in size. Finally, recorded the frequency of calls to a number of important functions and methods in core.ssh and BashCliScheduler.

First, safe_interval is set to 30 seconds. In this case, it takes 7 minutes to execute everything. During this time, transport.open() was called 8 times. histogram_ssh_225_30_seconds__safe_intervals

now, I set safe_interval to 0.1 seconds. Now, it takes 4 minutes and 28 seconds. During this time, transport.open() was called 7 times. histogram_ssh_225_0 5_seconds__safe_intervals

Since transport.open() was called only a few times, I think we may not need to further change the design, although there might be some improvement, but the benefit relative to the effort seems minor.

Note 1: In longer job scenarios, this volume of requests tends to be spread out over time. As a result, transport.open() may be called many more times perhaps. But I don't believe that would change the conclusion of this comment. Note 2: If something worth investigating is the two long gaps in the second plot, which theoretically should not be there! In any case, that is certainly not related to safe_interval.