[CERTTF-412][CERTTF-413] refactor: treat attachment unpacking as a separate phase

boukeas commented 1 month ago

Description

Retrieving and unpacking the attachments of a Testflinger job is currently handled within the agent code, before the agent starts going through the phases of that job. This imposes certain limitations with regards to how attachment-related failures are handled:

There are no attachment-related start, success or failure events transmitted. In fact, attachment unpacking currently takes place even before the job_start event is emitted. If an error occurs during attachment unpacking, there is currently no way to convey that through events.
There is no attachment-related result included in the results dict associated with a job. If an error occurs during attachment unpakcing, this is currently no way to convey that through the job results.
If an error occurs during attachment unpacking it can be seen and investigated through the agent logs. However, there is no output to the user submitting the job (as would be required for resolving e.g. CERTTF-412 and subsequently in the job results. Anyone polling a Testflinger job or monitoring the output of a Jenkins job would not be able to see any sort of error message like the one recorded in the agent logs.

One way to lift all these restrictions simultaneously is to treat the retrieval and unpacking of the job attachments as a separate phase.

Changelog

This PR:

Introduces a new TestflingerJobPhase abstract base class, with an interface that captures how all phases are supposed to work procedurally.
Introduces a new ExternalCommandPhase abstract base class, derived from TestflingerJobPhase, that captures the workings of phases that run a pre-configured external command. All previously existing phases fall into this category.
Refactors the code previously in TestflingerJob.run_test_phase into separate classes derived from ExternalCommandPhase, each corresponding to a different phase.
Introduces a new UnpackPhase derived from TestflingerJobPhase.
Fixes a minor issue in how tarfile is patched for Python 3.8 that allows directories to be included as attachments.

Some points to note while reviewing:

It makes little sense to review the "diff" for agent/testflinger_agent/job.py as the refactoring is considerable. It is best to view the file in its entirety.
There is now absolutely no attachment-related code under TestflingerAgent. It has all been moved into the newly introduced UnpackPhase. Functions/methods like unpack_attachments and secure_filter are now implemented as methods of the UnpackPhase, since this is the only phase they are relevant to.
In a similar vein, phase-specific functionality is restricted to the class representing the corresponding phase. See, for example, the wait_for_completion method or the post_core implementation for the allocate phase: these used to be methods of TestflingerJob, whereas now they are methods of AllocatePhase, which is the only phase they are relevant to.
Both jobs and job phases require access to the same bundle of parameters. This PR introduces the TestflingerJobParameters named tuple to hold this bundle. For each job, both the job and its phases store a reference to a single object of this class. This avoids duplication of this information across jobs and phases and, most importantly, allows jobs and phases to be loosely coupled, instead of one circularly referencing the other.
The rundir is no longer provided as an argument to TestflingerJob.run_test_phase. This directory is always the same for each job and can be determined when instantiating the job.

Resolved issues

Resolves CERTTF-412 and CERTTF-413.

Documentation

N/A

Web service API changes

N/A

Tests

boukeas commented 1 month ago

@plars @val500 I have noticed that <phase>_start and <phase>_success events are emitted (and recorded) even for phases that are subsequently skipped. Is this behaviour intentional, i.e. are there any advantages to doing this? I would argue that these events should only be emitted if the corresponding phase is actually executed.

boukeas commented 1 month ago

@plars The attachment error you reproduced is solved with this commit (part of this PR) and mentioned in the description:

Fixes a minor issue in how tarfile is patched for Python 3.8 that allows directories to be included as attachments.

So attaching directories is supported and the issue is fixed but what this PR aims to do is also provide support for surfacing any other attachment errors, in a manner consistent with how this is handled in other phases. More in a separate comment.

boukeas commented 1 month ago

@plars When you tried to reproduce the error, you had no indication as a Testflinger user that there was indeed an attachment error. If you polled the Testflinger output, there was nothing there. If you requested the job results, there was no <phase>_status exit code or any other field to reflect that something failed. And if you were monitoring the events emitted, again there would be no relevant events. These are all mechanisms that a Testflinger user can rely on in order to determine the outcome of a job and its phases but they don't apply to attachment unpacking: you had to go check the agent's log in order to see the error message, which is something that a user cannot do. This is all outlined in the PR description as well.

Of course we could handle attachment unpacking as a special case and do all that (add a result field, emit events and create a runner to generate error output) specifically for attachments but why would we when this is all done for each phase anyway? And, as a bonus, you get what I believe is a very sensible refactoring of the existing phases as well.

boukeas commented 1 month ago

Closing this PR as it attempts to resolve multiple issues at once:

Support for folder attachments has been resolved separately with a minor patch through #368
The refactoring proposed in this PR will be examined separately through #371
The issue with attachment failures not being included in the job results will be addressed in the context of CERTTF-423.

canonical / testflinger