checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.91k stars 583 forks source link

zdtm: Distinguish between fail and crash of dump #2376

Closed bsach64 closed 3 months ago

bsach64 commented 6 months ago

This PR distinguishes criu dump crash from criu dump fail in the zdtm.py script. Now, zdtm reports criu dump crash as a test failure.

Very similiar to the work done in this PR (#2140)

But, to detect a rpc crash we see that if the response from the RPC is invalid, then we assume the RPC crashed.

For detecting the crash for the CLI we can check the Popen.returncode https://docs.python.org/3/library/subprocess.html#subprocess.Popen.returncode

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

Fixes: #350

codecov-commenter commented 6 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 70.57%. Comparing base (00d7cdc) to head (6298e80).

:exclamation: Current head 6298e80 differs from pull request most recent head 4bfaf6e. Consider uploading reports for the commit 4bfaf6e to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## criu-dev #2376 +/- ## ============================================ - Coverage 70.73% 70.57% -0.16% ============================================ Files 136 135 -1 Lines 32940 33449 +509 ============================================ + Hits 23299 23606 +307 - Misses 9641 9843 +202 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Snorch commented 5 months ago

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714ccce714093170a2616474cfc7b33298c75e as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

bsach64 commented 5 months ago

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @snorch, I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly. On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh. There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

Snorch commented 5 months ago

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @Snorch, I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly.

Looks correct. But from zdtm.py point of view it may be not so simple, see how __criu_act retries action without fault, we likely don't want our fault to be retry-able by zdtm, so maybe just make it's id >=128 ?

On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh. There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

Yeh, it's a good question. Let's just use one random test with crfail (e.g.: vfork00). I don't see easy alternative to this approach, maybe someone else does...

bsach64 commented 5 months ago

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @Snorch, I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly.

Looks correct. But from zdtm.py point of view it may be not so simple, see how __criu_act retries action without fault, we likely don't want our fault to be retry-able by zdtm, so maybe just make it's id >=128 ?

I don't think giving the fault injection an id >= 128 is what we want. If I understand correctly, ids >= 128 refer to non-fatal fault injections (criu/include/fault-injection.h). Giving this fatal fault injection an id >= 128 causes tests to fail. I think currently zdtm retries all fatal fault injections. I will try to figure out a different way :) Can you elaborate on why we don't want our fault to be retry-able?

On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh. There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

Yeh, it's a good question. Let's just use one random test with crfail (e.g.: vfork00). I don't see easy alternative to this approach, maybe someone else does...

I have added vfork00 to criu-fault.sh.

Snorch commented 5 months ago

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @Snorch, I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly.

Looks correct. But from zdtm.py point of view it may be not so simple, see how __criu_act retries action without fault, we likely don't want our fault to be retry-able by zdtm, so maybe just make it's id >=128 ?

I don't think giving the fault injection an id >= 128 is what we want. If I understand correctly, ids >= 128 refer to non-fatal fault injections (criu/include/fault-injection.h). Giving this fatal fault injection an id >= 128 causes tests to fail.

I may misunderstand something, correct me if I'm wrong, but isn't a test fail exactly the behavior we expect in case of criu crash? In other words: if criu fails to dump/restore crfail test zdtm reports PASS, if criu crashes zdtm should report FAIL, isn't it? (And in criu-fault.sh we should expect FAIL from zdtm.)

I think currently zdtm retries all fatal fault injections. I will try to figure out a different way :) Can you elaborate on why we don't want our fault to be retry-able?

On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh. There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

Yeh, it's a good question. Let's just use one random test with crfail (e.g.: vfork00). I don't see easy alternative to this approach, maybe someone else does...

I have added vfork00 to criu-fault.sh.

bsach64 commented 5 months ago

To simulate the crash, I added a call to abort() before line 2104 criu/cr-dump.c

It looks like we need to add a new fault injection there. See f1714cc as an example. This way we would be able to add a test for it to test/jenkins/criu-fault.sh with --fault <id> option.

Hello @Snorch, I added a new fault injection for CRIU crashing on dump. Please let me know if this was done correctly.

Looks correct. But from zdtm.py point of view it may be not so simple, see how __criu_act retries action without fault, we likely don't want our fault to be retry-able by zdtm, so maybe just make it's id >=128 ?

I don't think giving the fault injection an id >= 128 is what we want. If I understand correctly, ids >= 128 refer to non-fatal fault injections (criu/include/fault-injection.h). Giving this fatal fault injection an id >= 128 causes tests to fail.

I may misunderstand something, correct me if I'm wrong, but isn't a test fail exactly the behavior we expect in case of criu crash? In other words: if criu fails to dump/restore crfail test zdtm reports PASS, if criu crashes zdtm should report FAIL, isn't it? (And in criu-fault.sh we should expect FAIL from zdtm.)

I think currently zdtm retries all fatal fault injections. I will try to figure out a different way :) Can you elaborate on why we don't want our fault to be retry-able?

On adding a test(s) to test/jenkins/criu-fault.sh, any test that specifies the crfail flag can (should?) be added to the criu-fault.sh. There are ~15 tests that specify the crfail flag should all these tests be added to criu-fault.sh? Is there a way that we can run all tests that specify a specific flag?

Yeh, it's a good question. Let's just use one random test with crfail (e.g.: vfork00). I don't see easy alternative to this approach, maybe someone else does...

I have added vfork00 to criu-fault.sh.

Sorry, the misunderstanding was on my part. You are correct; the tests previously passed as they were re-run without the fault injection. Thank you for all your help @Snorch :)

Snorch commented 5 months ago

Hm, test still does not pass with && fail as though we don't call fail function we still return non-zero exit code and with set -e it fails the script.

This seem to do the right trick:

if ./test/zdtm.py run -t zdtm/static/vfork00 --fault 136 --report report -f h ; then
        fail
fi

Also maybe someone else has better ideas?

github-actions[bot] commented 4 months ago

A friendly reminder that this PR had no activity for 30 days.

bsach64 commented 3 months ago

A friendly reminder that this PR had no activity for 30 days.

Can we remove the "stale-pr" label?

adrianreber commented 3 months ago

A friendly reminder that this PR had no activity for 30 days.

Can we remove the "stale-pr" label?

Removed.