async API - Githubissues

douglas-raillard-arm commented 3 years ago

Add an async API to devlib to take advantage of concurrency more easily, both at a coarse and fine grain.

Note: This currently requires python 3.7

Based on https://github.com/ARM-software/devlib/pull/568

douglas-raillard-arm commented 3 years ago

Preliminary results on:

        with target.cpufreq.use_governor('performance'):
            pass

1.4x faster on a SSH target (juno r0):

on this branch: 3.7s (avg over 10 runs)
on master: 5.3s (avg over 10 runs)

2.2x faster on a local target:

on this branch: ~9s
on master: ~20s

douglas-raillard-arm commented 3 years ago

I checked all instances of threading.Lock in devlib and it should be safe, as none is in the path used when running a background command or any other coroutine.

If a function taking a threading.Lock was being called by 2 coroutines that end up running concurrently, the 2nd coroutine would wait for the lock, thereby blocking the main thread and deadlocking.

douglas-raillard-arm commented 3 years ago

I also tried various combinations of thread pools, including running a blocking execute() in a different thread (so with separate SSH connections) and I did not get any real timing variations.

What I did get is some errors probably because I was maintaining too many connections on the SSH server, so the current approach with one connection and many concurrent SSH channels seems to be the best. I could not saturate the server with open channels (maybe there is no limit). If that was to happen, we could use an asyncio.Semaphore to limit the number of bg commands in flight.

douglas-raillard-arm commented 3 years ago

There are a few things to cleanup (docstring etc) so if you are happy with the API as it stands I will proceed with:

adding docstrings
renaming asyn.parallel into asyn.concurrently
remove the .concurrent attribute, as it is not that useful in practice I think (it allows running the function concurrently, only blocking to get the result when it is awaited. The "normal" couroutine behavior is to not do anything until the return value is awaited). asyn.concurrently is usually better as it makes it obvious what runs concurrently and avoids any kind of "task leak".

EDIT: cleanup done. Beyond possible high level doc and converting more bits, the change is somewhat ready.

douglas-raillard-arm commented 3 years ago

While working on a bug fix for another problem, it came to my attention that SSH servers accept a fixed number of "sessions" per connection. This "sessions" seem to map to paramiko's channels. This means that for a given connection, we might be limited to e.g. 10 concurrent channels (default of OpenSSH). This is configurable with MaxSessions on OpenSSH side and results in an exception on paramiko side: Could not open an SSH channel: ChannelException(2, 'Connect failed')

I'll probably need to handle that in that PR to limit the number of opened channels. There does not seem to be a standard way of getting the max number of channels, but we can probably handle the exception as a hint, or attempt to read OpenSSH config and parse it to find the MaxSessions value (a bit hacky though).

douglas-raillard-arm commented 2 years ago

@setrofim @marcbonnici I've updated the PR with an automatic detection of number of allowed background commands, along with some automatic limiting in the async API to use at most half of the available channels.

Next step:

Decide what version of Python we want to support
"instrument" read_value() and write_value() such that we can detect when we are trying to read/write to the same files concurrently. This should catch most issues where we are trying to do some setup concurrently, which is where the bulk of the speed up will come from.

douglas-raillard-arm commented 2 years ago

Updated the PR with checks to make sure that no file access conflicts when executing concurrent asyncio tasks. An artificial example:

from devlib.utils.asyn import *

m = ConcurrentResourceManager()

r1 = FileResource('host', 'file1', 'w')
r2 = FileResource('host', 'file1', 'r')

async def f1():
    # m.track_resource(r1)
    m.track_resource(r2)
    print('from f1')

async def f2():
    m.track_resource(r1)
    m.track_resource(r2)
    print('from f2')

async def f3():
    # m.track_resource(r1)
    print('from f3')

run(
    m.concurrently(
        (
            f1(),
            m.concurrently(
                (
                    f2(),
                    f3(),
                )
            )
        )
    )
)

In the PR, ConcurrentResourceManager is target.resource_manager, which is thread-local

Note: this depends on Python >= 3.7 for asyncio.current_task()

douglas-raillard-arm commented 2 years ago

PR updated with:

forcing python 3.7 in setup.py
Rename ConcurrentResourceManager to AsyncManager
Almost all methods of Target have been converted. Not all of them have been "smartly" converted though, e.g. by doing things concurrently.
Some more modules/collector updated to make use of concurrent commands (cgroups module init)

douglas-raillard-arm commented 2 years ago

@marcbonnici @setrofim Updated with a bunch of fixes that were needed to run on my test platform:

Add Target.execute(..., decode=True) parameter. When decode=False is used, execute() returns a tuple of bytes (stdout, stderr). This was necessary to successfully read sysfs content, which can contain binary data (from device tree). Note that the change to the Android connection class is entirely untested so a simple Target.execute('echo hello; echo world >&2 ', decode=False) on an android device would be useful.
Cleanup the cgroup module. Trying to parallelize the setup shed some light on a few problems with the code structure. This PR migrates it to use cgroup v2, cleans some cruft and exposes more clearly what pieces will be retainable in a future cgroup API that allows full cgroup power and what will probably become deprecated.

douglas-raillard-arm commented 2 years ago

I guess I could split the PR so we can merge the main bit, and then open a new one with the relatively unrelated fixes (cgroup v2 and read_tree_values() on binary content)

douglas-raillard-arm commented 2 years ago

@marcbonnici PR updated with:

fixes to cgroups:

only manipulate PIDs since thread support in cgroup v2 is significantly more complex and would require a dedicated API. This is probably best left for a future API
Fix to Controller.tasks('/'). Since existing code apparently expects to list all the tasks in the root group (rather than the root of the subtree handled by the controller), handle it as a special case.

Fix to Target.install(): Add a has_busybox=True parameter to Connection.execute(), so that install() is able to install busybox without needing to run it (for Android). This is the WIP commit which will be squashed if it works well.

douglas-raillard-arm commented 2 years ago

@marcbonnici @setrofim Updated the PR:

turned Target.execute(decode=False) into a separate method Target.execute_raw()
Implemented Target.execute_raw() on top of BackgroundCommand.communicate() to avoid invasive changes to Connection.execute().
removed cgroups modification, as they would be better served by a new module
Added busybox --install -s and the corresponding simplification in the code

douglas-raillard-arm commented 2 years ago

I've updated the PR with:

fixes for missing () and await
An (very unlikely) possible fix for the bg command benchmarking hang:
'{} xargs kill -9'.format(self.busybox),
'{} xargs kill -9 --'.format(self.busybox),

douglas-raillard-arm commented 2 years ago

If that does not work, I'll try another strategy without any killer_bg, but instead blocking on some stdin for each command (rather than a plain sleep). This way I should be able to unblock them easily.

douglas-raillard-arm commented 2 years ago

Updated PR. Added:

Added -- to devlib.connection._kill_pgid_cmd to avoid issues as well

Still to do:

split the addition of busybox to PATH in a separate PR (and fix the issues with ps output)
figure out why bg.__exit__ is hanging on your setup

marcbonnici commented 2 years ago

Finally found the issue on my android device and it was due to the PID reporting of adb connections. Occasionally it would report the PID of the command used to find the PID rather than the PID of the command itself due to PID reuse. I've submitted a PR to fix this so can continue with on my android devices.

douglas-raillard-arm commented 2 years ago

Very good news, so I'll try to shortly work on that PR to address the remaining things but I guess we don't have big blockers anymore so it shouldn't be a lot of work

douglas-raillard-arm commented 2 years ago

PR updated:

Changes:

Added max_bg connection parameter, so that users can force serialized execute() and also limit the auto-detection duration.
split the PR for busybox symlinks so it's not blocking this one: https://github.com/ARM-software/devlib/pull/585

TODO:

workaround/fix the android background command PID detection problem

douglas-raillard-arm commented 2 years ago

@marcbonnici PR updated again with:

fix the android background command PID detection problem. This is somewhat redundant with https://github.com/ARM-software/devlib/pull/581 but tries another approach. Instead of printing the PID to stdout, it writes it to a temp file. This avoids the issues of streams redirection, that are acceptable for SSH since it needs to be done anyway, but for adb it would add a great deal of complexity.

douglas-raillard-arm commented 2 years ago

@marcbonnici PR updated with a 2nd attempt to close the adb background PID race: the command is frozen and the thawed using SIGSTOP/SIGCONT. It's not pretty but is a relatively minor change on top of the existing code.

marcbonnici commented 2 years ago

@marcbonnici PR updated with a 2nd attempt to close the adb background PID race: the command is frozen and the thawed using SIGSTOP/SIGCONT. It's not pretty but is a relatively minor change on top of the existing code.

That's an interesting approach perhaps the way to go. Initially it seems to work well however occasionally something goes wrong when attempting to send the signals on my device.

while True:
    bg = ta.background("ls")
    print(bg.pid)

After a varying number of iterations it raises the following error:

    625 find_pid = f'''pid=$({conn.busybox} ps -A -o pid,args | {conn.busybox} {grep_cmd} | {conn.busybox} grep -v {quote(grep_cmd)} | {conn.busybox} awk '{{print $1}}') && kill -CONT "$pid" && printf "%s" "$pid"'''
    626 ps_out = conn.execute(find_pid)
--> 627 pid = int(ps_out)
    628 return (p, pid)

ValueError: invalid literal for int() with base 10: '/system/bin/sh: kill: : arguments must be jobs or process IDs\n'

So look like for some reason it's failing to find the correct PID, have you seen anything like this in your testing?

douglas-raillard-arm commented 2 years ago

PR updated with speedup of the async path when non-blocking calls are actually not required (when just one asyncio task is in use).

Wrt to the ps output, I think we need to print the command and output (without the filtering of grep) to understand what is going on. Maybe kill simply rejects PIDs that are no longer alive (and therefore not really a PID) ? I can't think of how that could happen though, since the shell process is frozen so it should not finish until released.

douglas-raillard-arm commented 2 years ago

I've tried a few thousand iterations of your test case and it worked without issues on my side.

Also updated the PR to use busybox kill instead of the system's one, maybe that will help ?

douglas-raillard-arm commented 2 years ago

Re-updated the PR with kill and printf swapped in PID detection. That will allow seeing the value of $pid even if kill fails for some reason.

marcbonnici commented 2 years ago

Thanks for the update, running the same test shows that no PID is being detected

    624 # Find the PID and release the blocked background command with SIGCONT
    625 find_pid = f'''pid=$({busybox} ps -A -o pid,args | {busybox} {grep_cmd} | {busybox} grep -v {quote(grep_cmd)} | {busybox} awk '{{print $1}}') && {busybox} printf "%s" "$pid" && {busybox} kill -CONT "$pid"'''
--> 626 pid = int(conn.execute(find_pid))
    627 return (p, pid)

ValueError: invalid literal for int() with base 10: ''

That's interesting that you don't see the same problem, I'm seeing this problem on my Nexus 5 over an ADB connection, however I also tried it on a emulator which took a lot longer (left the above while loop running for a few minutes) but it did eventually hit the same error.

As you said I'll need to try printing the separate parts of the command to see where it is failing. I can only think that the command is exiting before the kill -STOP has a chance to execute for some reason but I would need to investigate further.

douglas-raillard-arm commented 2 years ago

PR update with the -- removed in the killer_bg command since it was causing issue for busybox kill. Also now use busybox kill rather than system's kill so that all platforms will behave the same.

douglas-raillard-arm commented 2 years ago

Updated the PR with:

some squashing of commits that were touching the same part of code
A fix to Target.get_connection() that was not setting conn.busybox on new connections
2 new commits on top: Change the approach for non blocking execution

Previously, Target._execute_async() was based on Target.background(). This allowed sharing a single connection instance, but that also meant that it blocked until the command was spawned. Unfortunately, experiments have show that spawning a background command is quite expensive (e.g. on android where we end up with a blocking ps). Following Amdahl's law and experiments, that greatly reduced the speedup that was achievable (e.g. x2 speedup for 8 concurrent commands for one of my Juno SSH setup).

Fortunately, this is solvable: instead of relying on Target.background(), we can just create a thread pool with a separate connection instance each, and dispatch calls to these threads without blocking at all. This allows the best level of parallelism possible, and even avoids the inefficiencies of the Target.background() API (e.g. the need to know the PID even if it's not actually used). That also somewhat simplifies the implementation and separates it from any background command implementation bug.

Using this implementation, I was able to achieve a 15s runtime reduction on a LISA kernel test that takes 1min30 with devlib's master branch (it consists of varied uses of the API, lots of sysfs reading and writing, one background() execution for a test workload and some file pushing/pulling).

douglas-raillard-arm commented 2 years ago

Updated PR:

Split commits that are not required for this PR anymore into other PRs (fixes related to background())
Squashed commits that are modifying Target._execute_asyn()

douglas-raillard-arm commented 2 years ago

FWIW we have been using this PR in Lisa for a month now (Lisa repo vendorizes devlib as a git subtree) and so far I haven't observed or heard of any issue related to that

douglas-raillard-arm commented 2 years ago

@marcbonnici PR updated with:

Fix to connect(max_async)
Fix to cpufreq.use_governor() racy file write
Fix to async.asyncf() decorator for async generators: async generators are now consumed completely when crossing a blocking boundary, rather than failing to await on them. We could maybe do something a bit smarter to preserve lazyness provided that we can asyncio.run() anywhere, so I'll have a look later.
Added asyn.memoized_method() decorator that works for both async and non-async code. This does not memoize non-hashable data so it will avoid issues described here: https://github.com/ARM-software/devlib/issues/341
Converted the final bits of the cpufreq module to async

EDIT: Forgot to add the last bullet entry

douglas-raillard-arm commented 2 years ago

PR re-updated with a lazy async generator blocking shim. This preserves the laziness of the async gen, made possible by nest_asyncio package.

ARM-software / devlib

async API #569