Identify / Resolve Travis CI failures on aarch64

clebergnu commented 3 years ago

The last known good job that passed on Travis CI for aarch64 was:

https://travis-ci.com/github/avocado-framework/avocado/jobs/521150806
on commit 06f58fdae2a9cbc5e0d83a685e1c6d499c0feef8

The next change introduced that produced a failure was based on commit 7774fac86b175d608a71bb03836e8e70e7b10e3b alone, and at first, it doesn't seem to be the culprit.

clebergnu commented 3 years ago

I've checked the distros used (which could impact the environment between jobs) and the only relevant difference seems to be on the kernel:

Passing job: Runtime kernel version: 5.4.0-40-generic Failing jobs: Runtime kernel version: 5.8.0-59-generic

clebergnu commented 3 years ago

Also run the selftests on a Fedora 33 aarch64 machine with kernel 5.12.13-200.fc33.aarch64 without issues. Because there's a "Killed" message on the Travis-CI job, I've monitored resource consumption on the Avocado process and it seemed normal.

ana commented 3 years ago

Also run the selftests on a Fedora 33 aarch64 machine with kernel 5.12.13-200.fc33.aarch64 without issues. Because there's a "Killed" message on the Travis-CI job, I've monitored resource consumption on the Avocado process and it seemed normal.

FWIW, same behavior in Fedora 34 with 5.12.13-300.fc34.aarch64

clebergnu commented 3 years ago

A number of ideas came up:

Set the "allow failure" mode on that Travis CI job for now
Check/enable artifacts on Travis so that we can have access to the full job logs
Run the tests serially
Skip the tests running close to the KILLED message

ana commented 3 years ago

I've ended up doing further debugging on this one, my experiment is in this draft PR: https://github.com/avocado-framework/avocado/pull/4773

After making the selftests more granular, I have found there are issues with two of them: selftests/functional/test_basic.py and selftests/functional/plugin/test_logs.py

If you check the logs on s390x, you'll see they both exit with 1:

In the arm64 log, test_logs.py fails https://app.travis-ci.com/github/avocado-framework/avocado/jobs/524217658#L3110 while test_basic.py is killed (maybe because it fails and messes with travis functions?)

While playing until I arrived to this, I often saw the random fail was after running test_basic.py because it somehow seemed to break the following test.

ana commented 3 years ago

Following my previous comment, locally:

  [anguerre@anguerre functional]$ avocado run plugin/test_logs.py 
  JOB ID     : cbac50ee690f358da9a01d1a34d68ff9e27aa101
  JOB LOG    : /home/anguerre/avocado/job-results/job-2021-07-13T21.28-cbac50e/job.log
   (1/3) test_logs.TestLogsUI.test: PASS (1.12 s)
   (2/3) test_logs.TestLogsFilesUI.test_simpletest_logfiles: PASS (0.80 s)
   (3/3) test_logs.TestLogging.test_job_log: PASS (1.80 s)
  RESULTS    : PASS 3 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
  JOB HTML   : /home/anguerre/avocado/job-results/job-2021-07-13T21.28-cbac50e/results.html
  JOB TIME   : 5.58 s
  [anguerre@anguerre functional]$ echo $?
  0
  [anguerre@anguerre functional]$ avocado run --test-runner=nrunner   plugin/test_logs.py 
  JOB ID     : be260f58152f5207d9be0cb832f3a3a6adb327d8
  JOB LOG    : /home/anguerre/avocado/job-results/job-2021-07-13T21.28-be260f5/job.log
   (1/3) plugin/test_logs.py:TestLogsUI.test: STARTED
   (3/3) plugin/test_logs.py:TestLogging.test_job_log: STARTED
   (2/3) plugin/test_logs.py:TestLogsFilesUI.test_simpletest_logfiles: STARTED
   (2/3) plugin/test_logs.py:TestLogsFilesUI.test_simpletest_logfiles: PASS (0.60 s)
   (1/3) plugin/test_logs.py:TestLogsUI.test: PASS (0.86 s)
  RESULTS    : PASS 2 | ERROR 0 | FAIL 0 | SKIP 1 | WARN 0 | INTERRUPT 0 | CANCEL 0
  JOB HTML   : /home/anguerre/avocado/job-results/job-2021-07-13T21.28-be260f5/results.html
  JOB TIME   : 4.06 s
  [anguerre@anguerre functional]$ echo $?
  1

Same with test_basic.py

richtja commented 3 years ago

With results form @ana I think I find the problem. The nrunner has a problem with calling another avocado nrunner process inside test. You can reproduce it with:

from avocado import Test
from avocado.utils import process

class TestPom(Test):

    def test_pom(self):
        cmd_line = 'avocado run --test-runner=nrunner /bin/true'
        process.run(cmd_line)

results are:

avocado run --test-runner=nrunner /tmp/test.py 
JOB ID     : 2562848d3aaad1dd1d5272033bc7a335faac1fae
JOB LOG    : /home/jarichte/avocado/job-results/job-2021-07-14T14.49-2562848/job.log
 (1/1) /tmp/test.py:TestPom.test_pom: STARTED
RESULTS    : PASS 0 | ERROR 0 | FAIL 0 | SKIP 1 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML   : /home/jarichte/avocado/job-results/job-2021-07-14T14.49-2562848/results.html
JOB TIME   : 4.11 s

results with printed messages:

avocado run --test-runner=nrunner /tmp/test.py 
JOB ID     : 5eabdadc86621bee8a22bb98f5f4cf996948e1a5
JOB LOG    : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-5eabdad/job.log
{'status': 'started', 'time': 83602.972482151, 'output_dir': '/var/tmp/.avocado-task-y4u5m3jz'}
 (1/1) /tmp/test.py:TestPom.test_pom: STARTED
{'type': 'log', 'log': b'2021-07-14 14:48:37,281 test             L0303 INFO | INIT 1-TestPom.test_pom', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.979312157}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 parameters       L0142 DEBUG| PARAMS (key=timeout, path=*, default=None) => None', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.979877181}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0333 DEBUG| Test metadata:', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980035452}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0335 DEBUG|   filename: /tmp/test.py', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980136853}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0341 DEBUG|   teststmpdir: /tmp/avocado_fhieqbb3', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980202224}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0342 DEBUG|   workdir: /var/tmp/.avocado-taskn_zsb_k0/test-results/tmp_dir4kcg89by/1-TestPom.test_pom', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980248834}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0520 INFO | START 1-TestPom.test_pom', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.98053873}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0205 DEBUG| DATA (filename=output.expected) => NOT FOUND (data sources: variant, test, file)', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980639889}
{'type': 'log', 'log': b"2021-07-14 14:48:37,283 process          L0649 INFO | Running 'avocado run --test-runner=nrunner /bin/true'", 'encoding': 'utf-8', 'status': 'running', 'time': 83602.981155059}
{'status': 'running', 'time': 83603.089675691}
{'status': 'running', 'time': 83603.596106721}
{'status': 'running', 'time': 83604.102729578}
{'type': 'log', 'log': b'2021-07-14 14:48:38,495 process          L0461 DEBUG| [stdout] JOB ID     : 85612d4547585c89eebd7aca9475bb53022b9f09', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193480509}
{'type': 'log', 'log': b'2021-07-14 14:48:38,495 process          L0463 DEBUG| JOB ID     : 85612d4547585c89eebd7aca9475bb53022b9f09\n', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193699222}
{'type': 'log', 'log': b'2021-07-14 14:48:38,496 process          L0461 DEBUG| [stdout] JOB LOG    : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-85612d4/job.log', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193769345}
{'type': 'log', 'log': b'2021-07-14 14:48:38,496 process          L0463 DEBUG| JOB LOG    : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-85612d4/job.log\n', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193842971}
{'status': 'running', 'time': 83604.61016965}
{'status': 'started', 'time': 83605.073994736, 'output_dir': '/var/tmp/.avocado-task-d2y_rdv4'}
RESULTS    : PASS 0 | ERROR 0 | FAIL 0 | SKIP 1 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML   : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-5eabdad/results.html
JOB TIME   : 4.13 s

I think this might be related to the #4308

willianrampazzo commented 3 years ago

With results form @ana I think I find the problem. The nrunner has a problem with calling another avocado nrunner process inside test. You can reproduce it with:

from avocado import Test
from avocado.utils import process

class TestPom(Test):

    def test_pom(self):
        cmd_line = 'avocado run --test-runner=nrunner /bin/true'
        process.run(cmd_line)

results are:

avocado run --test-runner=nrunner /tmp/test.py 
JOB ID     : 2562848d3aaad1dd1d5272033bc7a335faac1fae
JOB LOG    : /home/jarichte/avocado/job-results/job-2021-07-14T14.49-2562848/job.log
 (1/1) /tmp/test.py:TestPom.test_pom: STARTED
RESULTS    : PASS 0 | ERROR 0 | FAIL 0 | SKIP 1 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML   : /home/jarichte/avocado/job-results/job-2021-07-14T14.49-2562848/results.html
JOB TIME   : 4.11 s

results with printed messages:

avocado run --test-runner=nrunner /tmp/test.py 
JOB ID     : 5eabdadc86621bee8a22bb98f5f4cf996948e1a5
JOB LOG    : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-5eabdad/job.log
{'status': 'started', 'time': 83602.972482151, 'output_dir': '/var/tmp/.avocado-task-y4u5m3jz'}
 (1/1) /tmp/test.py:TestPom.test_pom: STARTED
{'type': 'log', 'log': b'2021-07-14 14:48:37,281 test             L0303 INFO | INIT 1-TestPom.test_pom', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.979312157}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 parameters       L0142 DEBUG| PARAMS (key=timeout, path=*, default=None) => None', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.979877181}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0333 DEBUG| Test metadata:', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980035452}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0335 DEBUG|   filename: /tmp/test.py', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980136853}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0341 DEBUG|   teststmpdir: /tmp/avocado_fhieqbb3', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980202224}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0342 DEBUG|   workdir: /var/tmp/.avocado-taskn_zsb_k0/test-results/tmp_dir4kcg89by/1-TestPom.test_pom', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980248834}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0520 INFO | START 1-TestPom.test_pom', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.98053873}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test             L0205 DEBUG| DATA (filename=output.expected) => NOT FOUND (data sources: variant, test, file)', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980639889}
{'type': 'log', 'log': b"2021-07-14 14:48:37,283 process          L0649 INFO | Running 'avocado run --test-runner=nrunner /bin/true'", 'encoding': 'utf-8', 'status': 'running', 'time': 83602.981155059}
{'status': 'running', 'time': 83603.089675691}
{'status': 'running', 'time': 83603.596106721}
{'status': 'running', 'time': 83604.102729578}
{'type': 'log', 'log': b'2021-07-14 14:48:38,495 process          L0461 DEBUG| [stdout] JOB ID     : 85612d4547585c89eebd7aca9475bb53022b9f09', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193480509}
{'type': 'log', 'log': b'2021-07-14 14:48:38,495 process          L0463 DEBUG| JOB ID     : 85612d4547585c89eebd7aca9475bb53022b9f09\n', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193699222}
{'type': 'log', 'log': b'2021-07-14 14:48:38,496 process          L0461 DEBUG| [stdout] JOB LOG    : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-85612d4/job.log', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193769345}
{'type': 'log', 'log': b'2021-07-14 14:48:38,496 process          L0463 DEBUG| JOB LOG    : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-85612d4/job.log\n', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193842971}
{'status': 'running', 'time': 83604.61016965}
{'status': 'started', 'time': 83605.073994736, 'output_dir': '/var/tmp/.avocado-task-d2y_rdv4'}
RESULTS    : PASS 0 | ERROR 0 | FAIL 0 | SKIP 1 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML   : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-5eabdad/results.html
JOB TIME   : 4.13 s

I think this might be related to the #4308

Isn't it because of the conflicting status server?

richtja commented 3 years ago

@willianrampazzo I did some more tests and I think you are right. I will fix those tests, and we will see if it helps with Travis failures.

richtja commented 3 years ago

@ana @clebergnu @willianrampazzo unfortunately the fixes form #4776 didn't help, the problem has to be elsewhere.

richtja commented 3 years ago

I did some more tests with @ana's experiment and I found out that when the test_basic.py tests are run in serial mode, everything is fine. https://app.travis-ci.com/github/avocado-framework/avocado/jobs/524481481#L3266

willianrampazzo commented 3 years ago

I did some more tests with @ana's experiment and I found out that when the test_basic.py tests are run in serial mode, everything is fine. https://app.travis-ci.com/github/avocado-framework/avocado/jobs/524481481#L3266

That is a good suggestion that there are still tests colliding with the status server.

richtja commented 3 years ago

@willianrampazzo IMO it is not related to the status server issue, because I was able to run it with --nrunner-max-parallel-tasks=6. I think I found the problem in the documentation. When the Travis resources are exhausted, the Travis will kill the testing. I think the problem is that the Travis arm machine is able to run 100 processes, but the test_basic.py has 76 tests which also run their own processes. That is 152 processes and the resources are exhausted

willianrampazzo commented 3 years ago

@willianrampazzo IMO it is not related to the status server issue, because I was able to run it with --nrunner-max-parallel-tasks=6. I think I found the problem in the documentation. When the Travis resources are exhausted, the Travis will kill the testing. I think the problem is that the Travis arm machine is able to run 100 processes, but the test_basic.py has 76 tests which also run their own processes. That is 152 processes and the resources are exhausted

Ah, nice! That explains a lot.

ana commented 3 years ago

Mixing what I was doing in https://github.com/avocado-framework/avocado/pull/4773 and what @richtja was doing in https://github.com/avocado-framework/avocado/pull/4778, I have submitted a draft PR https://github.com/avocado-framework/avocado/pull/4779 So two questions:

how we want to solve this issue? Like in the draft PR? Or we just run all the tests with --half-cpus (or other name we chose) ? Or we only limit this on arm64?
Why when running test_basic.py and test_logs.py locally it exits with 1?

richtja commented 3 years ago

2. Why when running `test_basic.py` and `test_logs.py` locally it exits with 1?
Actually, this is not related to the Travis issue but to the #4308 and collisions with the status server. I fix the problematic tests in #4776. But we still need to solve the #4308 issue.

ana commented 3 years ago

For this issue, I think the best approach is https://github.com/avocado-framework/avocado/pull/4782 I'd like to add some of the --disable-... options for selfcheck/check.py that I made in the draft PR that should be part of a different PR, and I don't think we should add options like --half-cpus or --parallel-tasks only for this.

avocado-framework / avocado

Identify / Resolve Travis CI failures on aarch64 #4768