SUSE-Enceladus / img-proof

img-proof provides a command line utility to test images in the Public Cloud
https://img-proof.readthedocs.io/en/latest/
GNU General Public License v3.0
14 stars 7 forks source link

Do not retry hardened test if report was already generated #380

Closed ricardobranco777 closed 5 months ago

ricardobranco777 commented 5 months ago

This PR:

Rationale: We're using expensive VM's to run this particular test due to the workaround for OOM and these retries are burning money.

Related ticket: https://progress.opensuse.org/issues/158613 Verification run: https://openqa.suse.de/tests/13965566

The time for the VR to run was 22m12s instead of 38m42s in the original job. A 42,6% reduction.

asmorodskyi commented 5 months ago

LGTM

smarlowucf commented 5 months ago

Maybe we are trying to fix the problem at the wrong level? This is getting pretty complicated which will likely lead to more bugs in the test. In your test runs if you want this to run once you can pass in --retry-count 1. This avoids the problem at a lower level preventing the need for all of this complexity.

ricardobranco777 commented 5 months ago

Maybe we are trying to fix the problem at the wrong level? This is getting pretty complicated which will likely lead to more bugs in the test. In your test runs if you want this to run once you can pass in --retry-count 1. This avoids the problem at a lower level preventing the need for all of this complexity.

The problem with --retry-count 1 is that it causes false positives for transient issues making tests in openQA fail, having its own retry mechanism thus burning more money.

I can keep the PR simple. It's not that complicated.

smarlowucf commented 5 months ago

Can this specific test be run on it's own and aggregated to the full test suite?

ricardobranco777 commented 5 months ago

Can this specific test be run on it's own and aggregated to the full test suite?

It would add real complexity on our side.

This change has value on its own for your workloads as it reduces the time the whole test runs when oscap fails, saving money for SUSE and reducing our carbon footprint.

smarlowucf commented 5 months ago

So the main problem we are trying to solve is to not run this test only once? Given that the test relies on a log file to determine whether to run it it seems we can assume that the instance under test is ephemeral?

Currently the biggest problem with this test is that it will always fail. After the first run no matter what the result was if the log file exists the test gets marked pytest.fail. Instead we want to carry the result of the first test run. Thus it seems like we need to store a state variable such as "SAP_RESULT" (in the system under test). Where this could denote pass, fail, skip. That way instead of checking if the report exists we can check if there's a previous result stored then duplicate that result while skipping the test. I.e. pytest.{skip/pass/fail}.

ricardobranco777 commented 5 months ago

So the main problem we are trying to solve is to not run this test only once? Given that the test relies on a log file to determine whether to run it it seems we can assume that the instance under test is ephemeral?

I removed all references to the log file, checking instead for the report. Yes, we asumme the SUT is ephemeral.

Currently the biggest problem with this test is that it will always fail.

Not true after the latest changes. I assume pytest doesn't restart passed tests.

smarlowucf commented 5 months ago

Not true after the latest changes. I assume pytest doesn't restart passed tests.

pytest.fail is called explicitly if the report exists. https://github.com/SUSE-Enceladus/img-proof/pull/380/files#diff-ee3bc32d8d6d9cbcedc8b867e57b8f3f87dc764a02bc42f2132fe0d11c2f6eefR50

ricardobranco777 commented 5 months ago

Not true after the latest changes. I assume pytest doesn't restart passed tests.

pytest.fail is called explicitly if the report exists. https://github.com/SUSE-Enceladus/img-proof/pull/380/files#diff-ee3bc32d8d6d9cbcedc8b867e57b8f3f87dc764a02bc42f2132fe0d11c2f6eefR50

If the report exists we can safely assume it's a retry from a failed test.

smarlowucf commented 5 months ago

If the report exists we can safely assume it's a retry from a failed test.

What if the first run passes? Then there's no report and it gets re-run?

ricardobranco777 commented 5 months ago

If the report exists we can safely assume it's a retry from a failed test.

What if the first run passes? Then there's no report and it gets re-run?

If the first run passes the report is generated if SCAP_REPORT is set to a valid path. This tool or pytest isn't supposed to retry the test if it passed.

Do you mean manually run a 2nd time? In this case the report should be removed before running img-proof.

smarlowucf commented 5 months ago

Okay, sorry I was confusing the scenarios. If it passes there's a report but not a second run. So long as there's not a need to run the test suite on the same instance which could be an issue for us. Thus I think this is okay. But I may submit PR to remove this test from the default test suite if it causes us trouble. That would just mean that it would have to be included separately like img-proof test ... test_sles test_sles_hardened.

ricardobranco777 commented 5 months ago

Okay, sorry I was confusing the scenarios. If it passes there's a report but not a second run. So long as there's not a need to run the test suite on the same instance which could be an issue for us. Thus I think this is okay. But I may submit PR to remove this test from the default test suite if it causes us trouble. That would just mean that it would have to be included separately like img-proof test ... test_sles test_sles_hardened.

I don't see how it could be an issue for your team if you don't need the report... It fails only if the report already exists. AFAICT, you don't consume the report at all.

smarlowucf commented 5 months ago

Yes, not likely a problem.

ricardobranco777 commented 5 months ago

Yes, not likely a problem.

I nonetheless recommend generating the report on ephemeral instances to avoid having it run for longer when oscap fails.