Misc fixes - building in plan, PyYAML, datastream XML parsing

(These were originally part of a new audit-sample-rules test, https://github.com/RHSecurityCompliance/contest/issues/259, but I decided to split them off for easier discussion tracking.)

Going by commits, from oldest:

add `util/backup.py` for backup+restore of filesystem paths

This was for an earlier version of the test, which actually remediated the audit rules into /etc/audit - I first had a VM-using test, but realized we need to run it multi-arch, so I rewrote it without VMs, hence the backup/restore.

The current version of the test doesn't use this, but I think it would still be good to have it, rather than throw the code away. The early versions of Contest also had it: https://github.com/RHSecurityCompliance/contest/blob/bb2778816c463f9148395c8e566c7724ce60b28e/lib/backup.sh :smile:

start using PyYAML instead of hacking YAML syntax

From the commit message body:

    This was likely a RHEL-7 limitation, plus we didn't want to add
    unnecessary dependency on non-python-stdlib modules (like 'requests'
    or 'PyYAML').

    However we keep running into the need to parse YAML (ie. playbooks
    or TMT metadata), and python3*-pyyaml seems to be widely available
    in RHEL (even ie. python39-pyyaml on RHEL-8), unlike 'rpm' or 'dnf',
    so we might as well use it.

    The '/static-checks/ansible/allowed-modules' test already uses it
    without adding a RPM dependency, which hints at how widely available
    PyYAML is. Also, TMT probably depends on it too.

I specifically needed it in the audit test to parse ansible remediations and get .rules filepath / contents without having to sed/grep bash code.

build `CONTEST_CONTENT` in a TMT plan prepare step

Per the commit:

    This is because building takes a long time (10+ minutes) on some
    slow systems and that easily exceeds a 5/10 minute 'duration'
    of a small test that never anticipated the extra delay.

    Moving it to a plan moves it outside the 'duration' count.

    Also, it's probably a better central place given that all tests
    always share the built content so the ./build_product options
    have to be considered globally.

    Finally, this is not a great fix given that we are still left
    with the 'build' argument to get_content(),

        def get_content(build=True):
            ...

    and its SRPM-based building code, which still runs within test
    'duration', however there is no easy solution for that (other than
    getting rid of SRPM support entirely), so this commit at least
    tries to bump things in a right direction.

split off xml-parsing code from `class Datastream`

I originally wanted to redesign the whole XML parsing in class Datastream to have a more modular "what you want" data retrieval:

class Datastream(enum.Flag):
    """
    Represent an OpenSCAP datastream XML definition and enable extracting
    selected values from it.

    This is done by the class definition having enum.Flag style variables,
    which get overwritten by a class instance to either None (if the given
    value was not chosen to be extracted) or the contents of the extracted
    data.

    Ie.
        ds = Datastream()
        ds.parse_xml('file.xml', Datastream.profiles | Datastream.rules)

        profiles = ds.profiles          # is a dict() of profile metadata
        rules = ds.rules                # is a dict() of rule metadata
        remediations = ds.remediations  # is None (was not requested)

    This is to preserve reasonable speed and low memory overhead for use
    cases that don't need all the details.

    Data structure example:

        ds.path = Path(file_the_datastream_was_parsed_from)

        ds.profiles = {
          'ospp': namespace(
            .title = 'Some text',
            .rules = set( 'audit_delete_success' , 'service_firewalld_enabled' , ...),
            .values = set( ('var_rekey_limit_size','1G') , ...),
          ),
          ... (all profiles in the datastream) ...
        }

        ds.rules = {
          'configure_crypto_policy': namespace(
            .fixes = FixType.bash | FixType.ansible,
          ),
          'account_password_selinux_faillock_dir': namespace(
            .fixes = FixType.bash,
          ),
          ... (all rules in the datastream) ...
        }

        ds.remediations = {
          'configure_crypto_policy': {
            FixType.bash: '... bash code ...',
            FixType.ansible: '... ansible playbook ...',
          },
          'account_password_selinux_faillock_dir': namespace(
            FixType.bash: '... bash code ...',
          ),
          ... (all rules with at least one type of fix available) ...
        }

        (FixType is Datastream.FixType)
    """

However, mid-way through implementing it, I realized that when parsing out remediation code, we probably don't want to extract all 10s of MBs of code, just a few select rules.

So I went with what's in the commit - splitting off the batched XML parser, allowing the audit test to parse the DS XML on its own, extracting remediation code only for rules in the policy_rules Group.

I'm not against revisiting the class Datastream code in the future, but (depite the big +++ and ---), nothing has really changed inside, aside from one minor bug fix (elem.get to elements[-1].get).

RHSecurityCompliance / contest