avocado-framework / avocado

Avocado is a set of tools and libraries to help with automated testing. One can call it a test framework with benefits. Native tests are written in Python and they follow the unittest pattern, but any executable can serve as a test.
https://avocado-framework.github.io/
Other
343 stars 340 forks source link

Identify / Resolve Travis CI failures on aarch64 #4768

Closed clebergnu closed 3 years ago

clebergnu commented 3 years ago

The last known good job that passed on Travis CI for aarch64 was:

The next change introduced that produced a failure was based on commit 7774fac86b175d608a71bb03836e8e70e7b10e3b alone, and at first, it doesn't seem to be the culprit.

clebergnu commented 3 years ago

I've checked the distros used (which could impact the environment between jobs) and the only relevant difference seems to be on the kernel:

Passing job: Runtime kernel version: 5.4.0-40-generic Failing jobs: Runtime kernel version: 5.8.0-59-generic

clebergnu commented 3 years ago

Also run the selftests on a Fedora 33 aarch64 machine with kernel 5.12.13-200.fc33.aarch64 without issues. Because there's a "Killed" message on the Travis-CI job, I've monitored resource consumption on the Avocado process and it seemed normal.

ana commented 3 years ago

Also run the selftests on a Fedora 33 aarch64 machine with kernel 5.12.13-200.fc33.aarch64 without issues. Because there's a "Killed" message on the Travis-CI job, I've monitored resource consumption on the Avocado process and it seemed normal.

FWIW, same behavior in Fedora 34 with 5.12.13-300.fc34.aarch64

clebergnu commented 3 years ago

A number of ideas came up:

  1. Set the "allow failure" mode on that Travis CI job for now
  2. Check/enable artifacts on Travis so that we can have access to the full job logs
  3. Run the tests serially
  4. Skip the tests running close to the KILLED message
ana commented 3 years ago

I've ended up doing further debugging on this one, my experiment is in this draft PR: https://github.com/avocado-framework/avocado/pull/4773

After making the selftests more granular, I have found there are issues with two of them: selftests/functional/test_basic.py and selftests/functional/plugin/test_logs.py

If you check the logs on s390x, you'll see they both exit with 1:

In the arm64 log, test_logs.py fails https://app.travis-ci.com/github/avocado-framework/avocado/jobs/524217658#L3110 while test_basic.py is killed (maybe because it fails and messes with travis functions?)

While playing until I arrived to this, I often saw the random fail was after running test_basic.py because it somehow seemed to break the following test.

ana commented 3 years ago

Following my previous comment, locally:

  [anguerre@anguerre functional]$ avocado run plugin/test_logs.py 
  JOB ID     : cbac50ee690f358da9a01d1a34d68ff9e27aa101
  JOB LOG    : /home/anguerre/avocado/job-results/job-2021-07-13T21.28-cbac50e/job.log
   (1/3) test_logs.TestLogsUI.test: PASS (1.12 s)
   (2/3) test_logs.TestLogsFilesUI.test_simpletest_logfiles: PASS (0.80 s)
   (3/3) test_logs.TestLogging.test_job_log: PASS (1.80 s)
  RESULTS    : PASS 3 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
  JOB HTML   : /home/anguerre/avocado/job-results/job-2021-07-13T21.28-cbac50e/results.html
  JOB TIME   : 5.58 s
  [anguerre@anguerre functional]$ echo $?
  0
  [anguerre@anguerre functional]$ avocado run --test-runner=nrunner   plugin/test_logs.py 
  JOB ID     : be260f58152f5207d9be0cb832f3a3a6adb327d8
  JOB LOG    : /home/anguerre/avocado/job-results/job-2021-07-13T21.28-be260f5/job.log
   (1/3) plugin/test_logs.py:TestLogsUI.test: STARTED
   (3/3) plugin/test_logs.py:TestLogging.test_job_log: STARTED
   (2/3) plugin/test_logs.py:TestLogsFilesUI.test_simpletest_logfiles: STARTED
   (2/3) plugin/test_logs.py:TestLogsFilesUI.test_simpletest_logfiles: PASS (0.60 s)
   (1/3) plugin/test_logs.py:TestLogsUI.test: PASS (0.86 s)
  RESULTS    : PASS 2 | ERROR 0 | FAIL 0 | SKIP 1 | WARN 0 | INTERRUPT 0 | CANCEL 0
  JOB HTML   : /home/anguerre/avocado/job-results/job-2021-07-13T21.28-be260f5/results.html
  JOB TIME   : 4.06 s
  [anguerre@anguerre functional]$ echo $?
  1

Same with test_basic.py

richtja commented 3 years ago

With results form @ana I think I find the problem. The nrunner has a problem with calling another avocado nrunner process inside test. You can reproduce it with:

from avocado import Test
from avocado.utils import process

class TestPom(Test):

    def test_pom(self):
        cmd_line = 'avocado run --test-runner=nrunner /bin/true'
        process.run(cmd_line)

results are:

avocado run --test-runner=nrunner /tmp/test.py 
JOB ID     : 2562848d3aaad1dd1d5272033bc7a335faac1fae
JOB LOG    : /home/jarichte/avocado/job-results/job-2021-07-14T14.49-2562848/job.log
 (1/1) /tmp/test.py:TestPom.test_pom: STARTED
RESULTS    : PASS 0 | ERROR 0 | FAIL 0 | SKIP 1 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML   : /home/jarichte/avocado/job-results/job-2021-07-14T14.49-2562848/results.html
JOB TIME   : 4.11 s

results with printed messages:

avocado run --test-runner=nrunner /tmp/test.py 
JOB ID     : 5eabdadc86621bee8a22bb98f5f4cf996948e1a5
JOB LOG    : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-5eabdad/job.log
{'status': 'started', 'time': 83602.972482151, 'output_dir': '/var/tmp/.avocado-task-y4u5m3jz'}
 (1/1) /tmp/test.py:TestPom.test_pom: STARTED
{'type': 'log', 'log': b'2021-07-14 14:48:37,281 test             L0303 INFO | INIT 1-TestPom.test_pom', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.979312157}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 parameters       L0142 DEBUG| PARAMS (key=timeout, path=*, default=None) => None', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.979877181}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0333 DEBUG| Test metadata:', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980035452}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0335 DEBUG|   filename: /tmp/test.py', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980136853}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0341 DEBUG|   teststmpdir: /tmp/avocado_fhieqbb3', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980202224}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0342 DEBUG|   workdir: /var/tmp/.avocado-taskn_zsb_k0/test-results/tmp_dir4kcg89by/1-TestPom.test_pom', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980248834}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0520 INFO | START 1-TestPom.test_pom', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.98053873}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0205 DEBUG| DATA (filename=output.expected) => NOT FOUND (data sources: variant, test, file)', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980639889}
{'type': 'log', 'log': b"2021-07-14 14:48:37,283 process          L0649 INFO | Running 'avocado run --test-runner=nrunner /bin/true'", 'encoding': 'utf-8', 'status': 'running', 'time': 83602.981155059}
{'status': 'running', 'time': 83603.089675691}
{'status': 'running', 'time': 83603.596106721}
{'status': 'running', 'time': 83604.102729578}
{'type': 'log', 'log': b'2021-07-14 14:48:38,495 process          L0461 DEBUG| [stdout] JOB ID     : 85612d4547585c89eebd7aca9475bb53022b9f09', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193480509}
{'type': 'log', 'log': b'2021-07-14 14:48:38,495 process          L0463 DEBUG| JOB ID     : 85612d4547585c89eebd7aca9475bb53022b9f09\n', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193699222}
{'type': 'log', 'log': b'2021-07-14 14:48:38,496 process          L0461 DEBUG| [stdout] JOB LOG    : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-85612d4/job.log', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193769345}
{'type': 'log', 'log': b'2021-07-14 14:48:38,496 process          L0463 DEBUG| JOB LOG    : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-85612d4/job.log\n', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193842971}
{'status': 'running', 'time': 83604.61016965}
{'status': 'started', 'time': 83605.073994736, 'output_dir': '/var/tmp/.avocado-task-d2y_rdv4'}
RESULTS    : PASS 0 | ERROR 0 | FAIL 0 | SKIP 1 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML   : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-5eabdad/results.html
JOB TIME   : 4.13 s

I think this might be related to the #4308

willianrampazzo commented 3 years ago

With results form @ana I think I find the problem. The nrunner has a problem with calling another avocado nrunner process inside test. You can reproduce it with:

from avocado import Test
from avocado.utils import process

class TestPom(Test):

    def test_pom(self):
        cmd_line = 'avocado run --test-runner=nrunner /bin/true'
        process.run(cmd_line)

results are:

avocado run --test-runner=nrunner /tmp/test.py 
JOB ID     : 2562848d3aaad1dd1d5272033bc7a335faac1fae
JOB LOG    : /home/jarichte/avocado/job-results/job-2021-07-14T14.49-2562848/job.log
 (1/1) /tmp/test.py:TestPom.test_pom: STARTED
RESULTS    : PASS 0 | ERROR 0 | FAIL 0 | SKIP 1 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML   : /home/jarichte/avocado/job-results/job-2021-07-14T14.49-2562848/results.html
JOB TIME   : 4.11 s

results with printed messages:

avocado run --test-runner=nrunner /tmp/test.py 
JOB ID     : 5eabdadc86621bee8a22bb98f5f4cf996948e1a5
JOB LOG    : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-5eabdad/job.log
{'status': 'started', 'time': 83602.972482151, 'output_dir': '/var/tmp/.avocado-task-y4u5m3jz'}
 (1/1) /tmp/test.py:TestPom.test_pom: STARTED
{'type': 'log', 'log': b'2021-07-14 14:48:37,281 test             L0303 INFO | INIT 1-TestPom.test_pom', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.979312157}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 parameters       L0142 DEBUG| PARAMS (key=timeout, path=*, default=None) => None', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.979877181}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0333 DEBUG| Test metadata:', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980035452}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0335 DEBUG|   filename: /tmp/test.py', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980136853}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0341 DEBUG|   teststmpdir: /tmp/avocado_fhieqbb3', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980202224}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0342 DEBUG|   workdir: /var/tmp/.avocado-taskn_zsb_k0/test-results/tmp_dir4kcg89by/1-TestPom.test_pom', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980248834}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0520 INFO | START 1-TestPom.test_pom', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.98053873}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0205 DEBUG| DATA (filename=output.expected) => NOT FOUND (data sources: variant, test, file)', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980639889}
{'type': 'log', 'log': b"2021-07-14 14:48:37,283 process          L0649 INFO | Running 'avocado run --test-runner=nrunner /bin/true'", 'encoding': 'utf-8', 'status': 'running', 'time': 83602.981155059}
{'status': 'running', 'time': 83603.089675691}
{'status': 'running', 'time': 83603.596106721}
{'status': 'running', 'time': 83604.102729578}
{'type': 'log', 'log': b'2021-07-14 14:48:38,495 process          L0461 DEBUG| [stdout] JOB ID     : 85612d4547585c89eebd7aca9475bb53022b9f09', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193480509}
{'type': 'log', 'log': b'2021-07-14 14:48:38,495 process          L0463 DEBUG| JOB ID     : 85612d4547585c89eebd7aca9475bb53022b9f09\n', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193699222}
{'type': 'log', 'log': b'2021-07-14 14:48:38,496 process          L0461 DEBUG| [stdout] JOB LOG    : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-85612d4/job.log', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193769345}
{'type': 'log', 'log': b'2021-07-14 14:48:38,496 process          L0463 DEBUG| JOB LOG    : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-85612d4/job.log\n', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193842971}
{'status': 'running', 'time': 83604.61016965}
{'status': 'started', 'time': 83605.073994736, 'output_dir': '/var/tmp/.avocado-task-d2y_rdv4'}
RESULTS    : PASS 0 | ERROR 0 | FAIL 0 | SKIP 1 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML   : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-5eabdad/results.html
JOB TIME   : 4.13 s

I think this might be related to the #4308

Isn't it because of the conflicting status server?

richtja commented 3 years ago

@willianrampazzo I did some more tests and I think you are right. I will fix those tests, and we will see if it helps with Travis failures.

richtja commented 3 years ago

@ana @clebergnu @willianrampazzo unfortunately the fixes form #4776 didn't help, the problem has to be elsewhere.

richtja commented 3 years ago

I did some more tests with @ana's experiment and I found out that when the test_basic.py tests are run in serial mode, everything is fine. https://app.travis-ci.com/github/avocado-framework/avocado/jobs/524481481#L3266

willianrampazzo commented 3 years ago

I did some more tests with @ana's experiment and I found out that when the test_basic.py tests are run in serial mode, everything is fine. https://app.travis-ci.com/github/avocado-framework/avocado/jobs/524481481#L3266

That is a good suggestion that there are still tests colliding with the status server.

richtja commented 3 years ago

@willianrampazzo IMO it is not related to the status server issue, because I was able to run it with --nrunner-max-parallel-tasks=6. I think I found the problem in the documentation. When the Travis resources are exhausted, the Travis will kill the testing. I think the problem is that the Travis arm machine is able to run 100 processes, but the test_basic.py has 76 tests which also run their own processes. That is 152 processes and the resources are exhausted

willianrampazzo commented 3 years ago

@willianrampazzo IMO it is not related to the status server issue, because I was able to run it with --nrunner-max-parallel-tasks=6. I think I found the problem in the documentation. When the Travis resources are exhausted, the Travis will kill the testing. I think the problem is that the Travis arm machine is able to run 100 processes, but the test_basic.py has 76 tests which also run their own processes. That is 152 processes and the resources are exhausted

Ah, nice! That explains a lot.

ana commented 3 years ago

Mixing what I was doing in https://github.com/avocado-framework/avocado/pull/4773 and what @richtja was doing in https://github.com/avocado-framework/avocado/pull/4778, I have submitted a draft PR https://github.com/avocado-framework/avocado/pull/4779 So two questions:

  1. how we want to solve this issue? Like in the draft PR? Or we just run all the tests with --half-cpus (or other name we chose) ? Or we only limit this on arm64?
  2. Why when running test_basic.py and test_logs.py locally it exits with 1?
richtja commented 3 years ago
2. Why when running `test_basic.py` and `test_logs.py` locally it exits with 1?

Actually, this is not related to the Travis issue but to the #4308 and collisions with the status server. I fix the problematic tests in #4776. But we still need to solve the #4308 issue.

ana commented 3 years ago

For this issue, I think the best approach is https://github.com/avocado-framework/avocado/pull/4782 I'd like to add some of the --disable-... options for selfcheck/check.py that I made in the draft PR that should be part of a different PR, and I don't think we should add options like --half-cpus or --parallel-tasks only for this.