Rahix / tbot

Automation/Testing tool for Embedded Linux Development
https://tbot.tools
GNU General Public License v3.0
91 stars 21 forks source link

Sandbox/Locking prototypes #45

Closed Lusus closed 3 years ago

Lusus commented 3 years ago

This pull request contains a couple of prototypes which may be handy to add to tbot_contrib. They most certainly commit crimes against humanity in their current form.

The first patch contains a U-Boot Sandbox board configuration which can be useful to get started with Tbot, without actually requiring real hardware.

The second patch contains 3 different machine locking implementations, and uses the Sandbox as testbed. The first implementation is in Python/bash. The second uses procmail's lockfile utility, which served as "good example" for the first implementation. The third implementation uses the dotlockfile utility.

In the Sandbox, a default locking pool size of 2 has been configured, which means two Sandbox instances can be started before an exception will be thrown. The timeout value specifies for how long these locks are valid, but is commented out by default. With short lock timeouts and/or interactive_uboot sessions, the locks my expire without the instance noticing, which causes problems. The requirements around this may need to be reconsidered.

Rahix commented 3 years ago

When you have time, please rebase on the current master to pull in the CI changes that have happened.

Lusus commented 3 years ago

Okay, I tried my best to apply all of your comments, and I rebased on master. Hopefully I didn't mess anything up too badly. ;-)

I reordered the implementations as follows:

The first is what I think/hope is the best of the bunch, it is in Python/shell, supports expiry, but also checks whether the "owner" is still around before deleting stale locks. The second is similar, but does not check the owner. This does just what the spec asks. The third uses lockfile The fourth uses dotlockfile

I kept the pooling in the contextmanager/_init_pre_connect() functions, because I think hooking it in another way may get tricky. These functions then call their corresponding _locker() functions (all with the same signature) which does the actual locking in the various ways.

Lusus commented 3 years ago

Okay, I guess I will leave my comments here then.

About the CI, yes, I see the CI is a bit stricter than Pylint... Will fix

About the high level architecture. The main issue why it is currently in the board machine, is because the board would actually need to translate the lock number, to an actual machine in the pool, where the assumption was, we have a pool of machines, represented by the board machine definition. Are we ever intending to use a pool of locks for dissimilar machines? The lock should not get teared down when the machine gets turned off, as we are using the _init_pre_connect() context, and the lock should remain for the life of the lab session instead. And I know there is work in progress for making lab sessions live longer. ;-) But I will definitely try to do it as you requested (if I can figure it out).

The only reason there are/were multiple implementations of the locking is to make sure I don't code myself into a corner, and to try to make issues obvious. I hope that I spotted most of the gotchas, so I will stick with only the first implementation from now on. :-)

Rahix commented 3 years ago

About the CI, yes, I see the CI is a bit stricter than Pylint... Will fix

Regarding the errors about lock_no and lock_file in particular: Those should be instance attributes (that means attributes on an object of the type inheriting this class) while you're defining class attributes here (attributes on the class directly). You probably want to not define them on the class and only add them dynamically to the instance later on, during initialization.

The main issue why it is currently in the board machine, is because the board would actually need to translate the lock number, to an actual machine in the pool, where the assumption was, we have a pool of machines, represented by the board machine definition. Are we ever intending to use a pool of locks for dissimilar machines?

Maybe I should clarify, the sketch I posted above does support this. The PooledMachineLock would be inherited by the board-machine and can make the name of the acquired lock available for later steps during instantiation:

class PooledMachineLock(machine.PreConnectInitializer):
    @contextlib.contextmanager
    def _init_pre_connect(self) -> typing.Iterator:
         with tbot.ctx.request(tbot.role.LabHost) as lh:
            for lock_name in candidates:
                if lh.request_machine_lock(lock_name):
                    break
            else:
                raise Exception("could not acquire the lock")

            # Save name of acquired lock:
            self.active_lock = lock_name

            yield None

And then later you could, for example, use this when connecting:

class MyBoard(connector.ConsoleConnector, PooledMachineLock, board.Board):
    def connect(self, mach):
        return mach.open_channel("picocom", f"/dev/tty{self.active_lock}")

The lock should not get teared down when the machine gets turned off, as we are using the _init_pre_connect() context, and the lock should remain for the life of the lab session instead.

_init_pre_connect is still only a part of the board machines lifecycle, even if it is the first thing to run and the last to exit (think of it like the outermost shell of an onion maybe). So when the board machine is deinitialized, the _init_pre_connect context will exit as well. The lab-host lifetime is not at play here.

The only reason there are/were multiple implementations of the locking is to make sure I don't code myself into a corner, and to try to make issues obvious. I hope that I spotted most of the gotchas, so I will stick with only the first implementation from now on. :-)

Ah, got it, that wasn't clear to me, sorry!

Lusus commented 3 years ago

Okay, so I rebuilt everything in the form that you suggested, and I managed to get everything by pre-commit and friends. Whether it is correct is a second matter. I had to go back to self.lock_no = -1 to make it happen though. I am certain that there are multiple style issues outstanding.

To make testing the Sandbox more feasible, I added a locallab.py machine config and an args file (in the first commit). To ease the testing of the locking using the Sandbox, I added the lock-sandbox.py and lock-locallab.py machine configs and an args file (in the second commit).

The idea is, that by selecting the corresponding args file, the uboot_build and interactive_uboot testcases can then build/run the Sandbox locally. By running interactive_uboot in multiple terminals the locking can also be demonstrated.

Lusus commented 3 years ago

Okay, so I tried my best to apply all your previous fixes, and then I built a prototype using flock as a master lock to serialise the checking and removal of stale locks as FlockManager.

Testing has not yet been as extensive as for LockManager, and there are still details outstanding like docstrings, etc

Lusus commented 3 years ago

Sorry, forgot to rebase my branch on master...

Rahix commented 3 years ago

Note: Once you're done please also mark this PR as "ready for review" as it is no longer only a draft :)

Lusus commented 3 years ago

I applied the changes as you had requested, and tried to apply the principles consistently everywhere. I also tried to add docstrings etc. everywhere. I removed the Sandbox components from this branch as you requested; they are still available for convenient testing at: https://github.com/Lusus/tbot/tree/lusus-sandbox