ARM-software / devlib

Library for interaction with and instrumentation of remote devices.
Apache License 2.0
47 stars 78 forks source link

async API #569

Closed douglas-raillard-arm closed 2 years ago

douglas-raillard-arm commented 3 years ago

Add an async API to devlib to take advantage of concurrency more easily, both at a coarse and fine grain.

Note: This currently requires python 3.7

Based on https://github.com/ARM-software/devlib/pull/568

douglas-raillard-arm commented 3 years ago

Preliminary results on:

        with target.cpufreq.use_governor('performance'):
            pass

1.4x faster on a SSH target (juno r0):

2.2x faster on a local target:

douglas-raillard-arm commented 3 years ago

I checked all instances of threading.Lock in devlib and it should be safe, as none is in the path used when running a background command or any other coroutine.

If a function taking a threading.Lock was being called by 2 coroutines that end up running concurrently, the 2nd coroutine would wait for the lock, thereby blocking the main thread and deadlocking.

douglas-raillard-arm commented 3 years ago

I also tried various combinations of thread pools, including running a blocking execute() in a different thread (so with separate SSH connections) and I did not get any real timing variations.

What I did get is some errors probably because I was maintaining too many connections on the SSH server, so the current approach with one connection and many concurrent SSH channels seems to be the best. I could not saturate the server with open channels (maybe there is no limit). If that was to happen, we could use an asyncio.Semaphore to limit the number of bg commands in flight.

douglas-raillard-arm commented 3 years ago

There are a few things to cleanup (docstring etc) so if you are happy with the API as it stands I will proceed with:

EDIT: cleanup done. Beyond possible high level doc and converting more bits, the change is somewhat ready.

douglas-raillard-arm commented 3 years ago

While working on a bug fix for another problem, it came to my attention that SSH servers accept a fixed number of "sessions" per connection. This "sessions" seem to map to paramiko's channels. This means that for a given connection, we might be limited to e.g. 10 concurrent channels (default of OpenSSH). This is configurable with MaxSessions on OpenSSH side and results in an exception on paramiko side: Could not open an SSH channel: ChannelException(2, 'Connect failed')

I'll probably need to handle that in that PR to limit the number of opened channels. There does not seem to be a standard way of getting the max number of channels, but we can probably handle the exception as a hint, or attempt to read OpenSSH config and parse it to find the MaxSessions value (a bit hacky though).

douglas-raillard-arm commented 2 years ago

@setrofim @marcbonnici I've updated the PR with an automatic detection of number of allowed background commands, along with some automatic limiting in the async API to use at most half of the available channels.

Next step:

douglas-raillard-arm commented 2 years ago

Updated the PR with checks to make sure that no file access conflicts when executing concurrent asyncio tasks. An artificial example:

from devlib.utils.asyn import *

m = ConcurrentResourceManager()

r1 = FileResource('host', 'file1', 'w')
r2 = FileResource('host', 'file1', 'r')

async def f1():
    # m.track_resource(r1)
    m.track_resource(r2)
    print('from f1')

async def f2():
    m.track_resource(r1)
    m.track_resource(r2)
    print('from f2')

async def f3():
    # m.track_resource(r1)
    print('from f3')

run(
    m.concurrently(
        (
            f1(),
            m.concurrently(
                (
                    f2(),
                    f3(),
                )
            )
        )
    )
)

In the PR, ConcurrentResourceManager is target.resource_manager, which is thread-local

Note: this depends on Python >= 3.7 for asyncio.current_task()

douglas-raillard-arm commented 2 years ago

PR updated with:

douglas-raillard-arm commented 2 years ago

@marcbonnici @setrofim Updated with a bunch of fixes that were needed to run on my test platform:

douglas-raillard-arm commented 2 years ago

I guess I could split the PR so we can merge the main bit, and then open a new one with the relatively unrelated fixes (cgroup v2 and read_tree_values() on binary content)

douglas-raillard-arm commented 2 years ago

@marcbonnici PR updated with:

fixes to cgroups:

Fix to Target.install(): Add a has_busybox=True parameter to Connection.execute(), so that install() is able to install busybox without needing to run it (for Android). This is the WIP commit which will be squashed if it works well.

douglas-raillard-arm commented 2 years ago

@marcbonnici @setrofim Updated the PR:

douglas-raillard-arm commented 2 years ago

I've updated the PR with:

douglas-raillard-arm commented 2 years ago

If that does not work, I'll try another strategy without any killer_bg, but instead blocking on some stdin for each command (rather than a plain sleep). This way I should be able to unblock them easily.

douglas-raillard-arm commented 2 years ago

Updated PR. Added:

Still to do:

marcbonnici commented 2 years ago

Finally found the issue on my android device and it was due to the PID reporting of adb connections. Occasionally it would report the PID of the command used to find the PID rather than the PID of the command itself due to PID reuse. I've submitted a PR to fix this so can continue with on my android devices.

douglas-raillard-arm commented 2 years ago

Very good news, so I'll try to shortly work on that PR to address the remaining things but I guess we don't have big blockers anymore so it shouldn't be a lot of work

douglas-raillard-arm commented 2 years ago

PR updated:

Changes:

TODO:

douglas-raillard-arm commented 2 years ago

@marcbonnici PR updated again with:

douglas-raillard-arm commented 2 years ago

@marcbonnici PR updated with a 2nd attempt to close the adb background PID race: the command is frozen and the thawed using SIGSTOP/SIGCONT. It's not pretty but is a relatively minor change on top of the existing code.

marcbonnici commented 2 years ago

@marcbonnici PR updated with a 2nd attempt to close the adb background PID race: the command is frozen and the thawed using SIGSTOP/SIGCONT. It's not pretty but is a relatively minor change on top of the existing code.

That's an interesting approach perhaps the way to go. Initially it seems to work well however occasionally something goes wrong when attempting to send the signals on my device.

while True:
    bg = ta.background("ls")
    print(bg.pid)

After a varying number of iterations it raises the following error:

    625 find_pid = f'''pid=$({conn.busybox} ps -A -o pid,args | {conn.busybox} {grep_cmd} | {conn.busybox} grep -v {quote(grep_cmd)} | {conn.busybox} awk '{{print $1}}') && kill -CONT "$pid" && printf "%s" "$pid"'''
    626 ps_out = conn.execute(find_pid)
--> 627 pid = int(ps_out)
    628 return (p, pid)

ValueError: invalid literal for int() with base 10: '/system/bin/sh: kill: : arguments must be jobs or process IDs\n'

So look like for some reason it's failing to find the correct PID, have you seen anything like this in your testing?

douglas-raillard-arm commented 2 years ago

PR updated with speedup of the async path when non-blocking calls are actually not required (when just one asyncio task is in use).

Wrt to the ps output, I think we need to print the command and output (without the filtering of grep) to understand what is going on. Maybe kill simply rejects PIDs that are no longer alive (and therefore not really a PID) ? I can't think of how that could happen though, since the shell process is frozen so it should not finish until released.

douglas-raillard-arm commented 2 years ago

I've tried a few thousand iterations of your test case and it worked without issues on my side.

Also updated the PR to use busybox kill instead of the system's one, maybe that will help ?

douglas-raillard-arm commented 2 years ago

Re-updated the PR with kill and printf swapped in PID detection. That will allow seeing the value of $pid even if kill fails for some reason.

marcbonnici commented 2 years ago

Thanks for the update, running the same test shows that no PID is being detected

    624 # Find the PID and release the blocked background command with SIGCONT
    625 find_pid = f'''pid=$({busybox} ps -A -o pid,args | {busybox} {grep_cmd} | {busybox} grep -v {quote(grep_cmd)} | {busybox} awk '{{print $1}}') && {busybox} printf "%s" "$pid" && {busybox} kill -CONT "$pid"'''
--> 626 pid = int(conn.execute(find_pid))
    627 return (p, pid)

ValueError: invalid literal for int() with base 10: ''

That's interesting that you don't see the same problem, I'm seeing this problem on my Nexus 5 over an ADB connection, however I also tried it on a emulator which took a lot longer (left the above while loop running for a few minutes) but it did eventually hit the same error.

As you said I'll need to try printing the separate parts of the command to see where it is failing. I can only think that the command is exiting before the kill -STOP has a chance to execute for some reason but I would need to investigate further.

douglas-raillard-arm commented 2 years ago

PR update with the -- removed in the killer_bg command since it was causing issue for busybox kill. Also now use busybox kill rather than system's kill so that all platforms will behave the same.

douglas-raillard-arm commented 2 years ago

Updated the PR with:

Previously, Target._execute_async() was based on Target.background(). This allowed sharing a single connection instance, but that also meant that it blocked until the command was spawned. Unfortunately, experiments have show that spawning a background command is quite expensive (e.g. on android where we end up with a blocking ps). Following Amdahl's law and experiments, that greatly reduced the speedup that was achievable (e.g. x2 speedup for 8 concurrent commands for one of my Juno SSH setup).

Fortunately, this is solvable: instead of relying on Target.background(), we can just create a thread pool with a separate connection instance each, and dispatch calls to these threads without blocking at all. This allows the best level of parallelism possible, and even avoids the inefficiencies of the Target.background() API (e.g. the need to know the PID even if it's not actually used). That also somewhat simplifies the implementation and separates it from any background command implementation bug.

Using this implementation, I was able to achieve a 15s runtime reduction on a LISA kernel test that takes 1min30 with devlib's master branch (it consists of varied uses of the API, lots of sysfs reading and writing, one background() execution for a test workload and some file pushing/pulling).

douglas-raillard-arm commented 2 years ago

Updated PR:

douglas-raillard-arm commented 2 years ago

FWIW we have been using this PR in Lisa for a month now (Lisa repo vendorizes devlib as a git subtree) and so far I haven't observed or heard of any issue related to that

douglas-raillard-arm commented 2 years ago

@marcbonnici PR updated with:

EDIT: Forgot to add the last bullet entry

douglas-raillard-arm commented 2 years ago

PR re-updated with a lazy async generator blocking shim. This preserves the laziness of the async gen, made possible by nest_asyncio package.