Closed clebergnu closed 3 years ago
I've checked the distros used (which could impact the environment between jobs) and the only relevant difference seems to be on the kernel:
Passing job: Runtime kernel version: 5.4.0-40-generic
Failing jobs: Runtime kernel version: 5.8.0-59-generic
Also run the selftests on a Fedora 33 aarch64 machine with kernel 5.12.13-200.fc33.aarch64
without issues. Because there's a "Killed" message on the Travis-CI job, I've monitored resource consumption on the Avocado process and it seemed normal.
Also run the selftests on a Fedora 33 aarch64 machine with kernel
5.12.13-200.fc33.aarch64
without issues. Because there's a "Killed" message on the Travis-CI job, I've monitored resource consumption on the Avocado process and it seemed normal.
FWIW, same behavior in Fedora 34 with 5.12.13-300.fc34.aarch64
A number of ideas came up:
KILLED
messageI've ended up doing further debugging on this one, my experiment is in this draft PR: https://github.com/avocado-framework/avocado/pull/4773
After making the selftests more granular, I have found there are issues with two of them: selftests/functional/test_basic.py
and selftests/functional/plugin/test_logs.py
If you check the logs on s390x, you'll see they both exit with 1:
In the arm64 log, test_logs.py
fails https://app.travis-ci.com/github/avocado-framework/avocado/jobs/524217658#L3110
while test_basic.py
is killed (maybe because it fails and messes with travis functions?)
While playing until I arrived to this, I often saw the random fail was after running test_basic.py
because it somehow seemed to break the following test.
Following my previous comment, locally:
[anguerre@anguerre functional]$ avocado run plugin/test_logs.py
JOB ID : cbac50ee690f358da9a01d1a34d68ff9e27aa101
JOB LOG : /home/anguerre/avocado/job-results/job-2021-07-13T21.28-cbac50e/job.log
(1/3) test_logs.TestLogsUI.test: PASS (1.12 s)
(2/3) test_logs.TestLogsFilesUI.test_simpletest_logfiles: PASS (0.80 s)
(3/3) test_logs.TestLogging.test_job_log: PASS (1.80 s)
RESULTS : PASS 3 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML : /home/anguerre/avocado/job-results/job-2021-07-13T21.28-cbac50e/results.html
JOB TIME : 5.58 s
[anguerre@anguerre functional]$ echo $?
0
[anguerre@anguerre functional]$ avocado run --test-runner=nrunner plugin/test_logs.py
JOB ID : be260f58152f5207d9be0cb832f3a3a6adb327d8
JOB LOG : /home/anguerre/avocado/job-results/job-2021-07-13T21.28-be260f5/job.log
(1/3) plugin/test_logs.py:TestLogsUI.test: STARTED
(3/3) plugin/test_logs.py:TestLogging.test_job_log: STARTED
(2/3) plugin/test_logs.py:TestLogsFilesUI.test_simpletest_logfiles: STARTED
(2/3) plugin/test_logs.py:TestLogsFilesUI.test_simpletest_logfiles: PASS (0.60 s)
(1/3) plugin/test_logs.py:TestLogsUI.test: PASS (0.86 s)
RESULTS : PASS 2 | ERROR 0 | FAIL 0 | SKIP 1 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML : /home/anguerre/avocado/job-results/job-2021-07-13T21.28-be260f5/results.html
JOB TIME : 4.06 s
[anguerre@anguerre functional]$ echo $?
1
Same with test_basic.py
With results form @ana I think I find the problem. The nrunner has a problem with calling another avocado nrunner process inside test. You can reproduce it with:
from avocado import Test
from avocado.utils import process
class TestPom(Test):
def test_pom(self):
cmd_line = 'avocado run --test-runner=nrunner /bin/true'
process.run(cmd_line)
results are:
avocado run --test-runner=nrunner /tmp/test.py
JOB ID : 2562848d3aaad1dd1d5272033bc7a335faac1fae
JOB LOG : /home/jarichte/avocado/job-results/job-2021-07-14T14.49-2562848/job.log
(1/1) /tmp/test.py:TestPom.test_pom: STARTED
RESULTS : PASS 0 | ERROR 0 | FAIL 0 | SKIP 1 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML : /home/jarichte/avocado/job-results/job-2021-07-14T14.49-2562848/results.html
JOB TIME : 4.11 s
results with printed messages:
avocado run --test-runner=nrunner /tmp/test.py
JOB ID : 5eabdadc86621bee8a22bb98f5f4cf996948e1a5
JOB LOG : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-5eabdad/job.log
{'status': 'started', 'time': 83602.972482151, 'output_dir': '/var/tmp/.avocado-task-y4u5m3jz'}
(1/1) /tmp/test.py:TestPom.test_pom: STARTED
{'type': 'log', 'log': b'2021-07-14 14:48:37,281 test L0303 INFO | INIT 1-TestPom.test_pom', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.979312157}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 parameters L0142 DEBUG| PARAMS (key=timeout, path=*, default=None) => None', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.979877181}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test L0333 DEBUG| Test metadata:', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980035452}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test L0335 DEBUG| filename: /tmp/test.py', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980136853}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test L0341 DEBUG| teststmpdir: /tmp/avocado_fhieqbb3', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980202224}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test L0342 DEBUG| workdir: /var/tmp/.avocado-taskn_zsb_k0/test-results/tmp_dir4kcg89by/1-TestPom.test_pom', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980248834}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test L0520 INFO | START 1-TestPom.test_pom', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.98053873}
{'type': 'log', 'log': b'2021-07-14 14:48:37,282 test L0205 DEBUG| DATA (filename=output.expected) => NOT FOUND (data sources: variant, test, file)', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980639889}
{'type': 'log', 'log': b"2021-07-14 14:48:37,283 process L0649 INFO | Running 'avocado run --test-runner=nrunner /bin/true'", 'encoding': 'utf-8', 'status': 'running', 'time': 83602.981155059}
{'status': 'running', 'time': 83603.089675691}
{'status': 'running', 'time': 83603.596106721}
{'status': 'running', 'time': 83604.102729578}
{'type': 'log', 'log': b'2021-07-14 14:48:38,495 process L0461 DEBUG| [stdout] JOB ID : 85612d4547585c89eebd7aca9475bb53022b9f09', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193480509}
{'type': 'log', 'log': b'2021-07-14 14:48:38,495 process L0463 DEBUG| JOB ID : 85612d4547585c89eebd7aca9475bb53022b9f09\n', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193699222}
{'type': 'log', 'log': b'2021-07-14 14:48:38,496 process L0461 DEBUG| [stdout] JOB LOG : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-85612d4/job.log', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193769345}
{'type': 'log', 'log': b'2021-07-14 14:48:38,496 process L0463 DEBUG| JOB LOG : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-85612d4/job.log\n', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193842971}
{'status': 'running', 'time': 83604.61016965}
{'status': 'started', 'time': 83605.073994736, 'output_dir': '/var/tmp/.avocado-task-d2y_rdv4'}
RESULTS : PASS 0 | ERROR 0 | FAIL 0 | SKIP 1 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-5eabdad/results.html
JOB TIME : 4.13 s
I think this might be related to the #4308
With results form @ana I think I find the problem. The nrunner has a problem with calling another avocado nrunner process inside test. You can reproduce it with:
from avocado import Test from avocado.utils import process class TestPom(Test): def test_pom(self): cmd_line = 'avocado run --test-runner=nrunner /bin/true' process.run(cmd_line)
results are:
avocado run --test-runner=nrunner /tmp/test.py JOB ID : 2562848d3aaad1dd1d5272033bc7a335faac1fae JOB LOG : /home/jarichte/avocado/job-results/job-2021-07-14T14.49-2562848/job.log (1/1) /tmp/test.py:TestPom.test_pom: STARTED RESULTS : PASS 0 | ERROR 0 | FAIL 0 | SKIP 1 | WARN 0 | INTERRUPT 0 | CANCEL 0 JOB HTML : /home/jarichte/avocado/job-results/job-2021-07-14T14.49-2562848/results.html JOB TIME : 4.11 s
results with printed messages:
avocado run --test-runner=nrunner /tmp/test.py JOB ID : 5eabdadc86621bee8a22bb98f5f4cf996948e1a5 JOB LOG : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-5eabdad/job.log {'status': 'started', 'time': 83602.972482151, 'output_dir': '/var/tmp/.avocado-task-y4u5m3jz'} (1/1) /tmp/test.py:TestPom.test_pom: STARTED {'type': 'log', 'log': b'2021-07-14 14:48:37,281 test L0303 INFO | INIT 1-TestPom.test_pom', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.979312157} {'type': 'log', 'log': b'2021-07-14 14:48:37,282 parameters L0142 DEBUG| PARAMS (key=timeout, path=*, default=None) => None', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.979877181} {'type': 'log', 'log': b'2021-07-14 14:48:37,282 test L0333 DEBUG| Test metadata:', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980035452} {'type': 'log', 'log': b'2021-07-14 14:48:37,282 test L0335 DEBUG| filename: /tmp/test.py', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980136853} {'type': 'log', 'log': b'2021-07-14 14:48:37,282 test L0341 DEBUG| teststmpdir: /tmp/avocado_fhieqbb3', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980202224} {'type': 'log', 'log': b'2021-07-14 14:48:37,282 test L0342 DEBUG| workdir: /var/tmp/.avocado-taskn_zsb_k0/test-results/tmp_dir4kcg89by/1-TestPom.test_pom', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980248834} {'type': 'log', 'log': b'2021-07-14 14:48:37,282 test L0520 INFO | START 1-TestPom.test_pom', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.98053873} {'type': 'log', 'log': b'2021-07-14 14:48:37,282 test L0205 DEBUG| DATA (filename=output.expected) => NOT FOUND (data sources: variant, test, file)', 'encoding': 'utf-8', 'status': 'running', 'time': 83602.980639889} {'type': 'log', 'log': b"2021-07-14 14:48:37,283 process L0649 INFO | Running 'avocado run --test-runner=nrunner /bin/true'", 'encoding': 'utf-8', 'status': 'running', 'time': 83602.981155059} {'status': 'running', 'time': 83603.089675691} {'status': 'running', 'time': 83603.596106721} {'status': 'running', 'time': 83604.102729578} {'type': 'log', 'log': b'2021-07-14 14:48:38,495 process L0461 DEBUG| [stdout] JOB ID : 85612d4547585c89eebd7aca9475bb53022b9f09', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193480509} {'type': 'log', 'log': b'2021-07-14 14:48:38,495 process L0463 DEBUG| JOB ID : 85612d4547585c89eebd7aca9475bb53022b9f09\n', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193699222} {'type': 'log', 'log': b'2021-07-14 14:48:38,496 process L0461 DEBUG| [stdout] JOB LOG : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-85612d4/job.log', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193769345} {'type': 'log', 'log': b'2021-07-14 14:48:38,496 process L0463 DEBUG| JOB LOG : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-85612d4/job.log\n', 'encoding': 'utf-8', 'status': 'running', 'time': 83604.193842971} {'status': 'running', 'time': 83604.61016965} {'status': 'started', 'time': 83605.073994736, 'output_dir': '/var/tmp/.avocado-task-d2y_rdv4'} RESULTS : PASS 0 | ERROR 0 | FAIL 0 | SKIP 1 | WARN 0 | INTERRUPT 0 | CANCEL 0 JOB HTML : /home/jarichte/avocado/job-results/job-2021-07-14T14.48-5eabdad/results.html JOB TIME : 4.13 s
I think this might be related to the #4308
Isn't it because of the conflicting status server?
@willianrampazzo I did some more tests and I think you are right. I will fix those tests, and we will see if it helps with Travis failures.
@ana @clebergnu @willianrampazzo unfortunately the fixes form #4776 didn't help, the problem has to be elsewhere.
I did some more tests with @ana's experiment and I found out that when the test_basic.py
tests are run in serial mode, everything is fine.
https://app.travis-ci.com/github/avocado-framework/avocado/jobs/524481481#L3266
I did some more tests with @ana's experiment and I found out that when the
test_basic.py
tests are run in serial mode, everything is fine. https://app.travis-ci.com/github/avocado-framework/avocado/jobs/524481481#L3266
That is a good suggestion that there are still tests colliding with the status server.
@willianrampazzo IMO it is not related to the status server issue, because I was able to run it with --nrunner-max-parallel-tasks=6
. I think I found the problem in the documentation. When the Travis resources are exhausted, the Travis will kill the testing. I think the problem is that the Travis arm machine is able to run 100 processes, but the test_basic.py
has 76 tests which also run their own processes. That is 152 processes and the resources are exhausted
@willianrampazzo IMO it is not related to the status server issue, because I was able to run it with
--nrunner-max-parallel-tasks=6
. I think I found the problem in the documentation. When the Travis resources are exhausted, the Travis will kill the testing. I think the problem is that the Travis arm machine is able to run 100 processes, but thetest_basic.py
has 76 tests which also run their own processes. That is 152 processes and the resources are exhausted
Ah, nice! That explains a lot.
Mixing what I was doing in https://github.com/avocado-framework/avocado/pull/4773 and what @richtja was doing in https://github.com/avocado-framework/avocado/pull/4778, I have submitted a draft PR https://github.com/avocado-framework/avocado/pull/4779 So two questions:
--half-cpus
(or other name we chose) ? Or we only limit this on arm64? test_basic.py
and test_logs.py
locally it exits with 1?2. Why when running `test_basic.py` and `test_logs.py` locally it exits with 1?
Actually, this is not related to the Travis issue but to the #4308 and collisions with the status server. I fix the problematic tests in #4776. But we still need to solve the #4308 issue.
For this issue, I think the best approach is https://github.com/avocado-framework/avocado/pull/4782
I'd like to add some of the --disable-...
options for selfcheck/check.py
that I made in the draft PR that should be part of a different PR, and I don't think we should add options like --half-cpus
or --parallel-tasks
only for this.
The last known good job that passed on Travis CI for aarch64 was:
The next change introduced that produced a failure was based on commit 7774fac86b175d608a71bb03836e8e70e7b10e3b alone, and at first, it doesn't seem to be the culprit.