Closed dlezcano closed 6 years ago
Hi,
I have not come across that issue before, do you get that same error if you run that command multiple times yourself? / Do you get this with multiple devices?
Yes, I get the issue if I run the command multiple times. Yes, this issue has been reported with another device but same platform (hikey960)
Perhaps, this is related to the error I can see in the console: [ 1192.961243] binder_alloc: 13020: binder_alloc_buf, no vma [ 1192.961243] binder_alloc: 13020: binder_alloc_buf, no vma [ 1192.972464] binder: 2724:10218 transaction failed 29189/-3, size 108-0 line 3135 [ 1192.972464] binder: 2724:10218 transaction failed 29189/-3, size 108-0 line 3135 [ 1193.014989] binder: undelivered transaction 338114, process died. [ 1193.014989] binder: undelivered transaction 338114, process died. [ 1193.031581] binder: undelivered transaction 338093, process died. [ 1193.031581] binder: undelivered transaction 338093, process died.
Is that with the the same image on the different devices? The other crashes in the console likely indicate that there is something else wrong with the Android system rather than a WA issue itself.
Yes, it is with the same image. Agree it must be an Android related issue. Closing the issue.
It seems like this error can happen when there are multiple instances of logcat running at the same time. I'm wondering if WA could have miss a 'wait' after exec'ing the adb logcat -c (or may be adb is closing the connection before waiting the child process?). I'm reopening this issue in order to double check if that can be the case from the WA side.
The actual clearing of logcat on the device is performed via devlib (https://github.com/ARM-software/devlib/blob/master/devlib/target.py#L1386) and the implementation there uses a lock to prevent multiple instances and will wait until the command is completed to check the error code so I think these cases should already be guarded against.
When you say the error occurs with multiple instances of logcat running, do you still experience it if you are not also monitoring logcat yourself?
Sorry I missed your answer. I reproduced the issue and monitored a 'logcat' instance. There is a: shell 28556 24458 0 08:36:59 pts/1 00:00:00 logcat
stuck on the 'recvmsg' syscall on a socket.
I verified before running the scenario no instance was running. So somehow, the logcat is blocked on the socket and WA3 does not wait it (or does not return an error if a timeout occurs leading the error for the next run).
Agenda triggering the issue:
sections:
workloads:
Thank you very much for your agenda, I managed to reproduce the issue here and the jankbench monitor was clearing logcat itself rather than using the method available in devlib.
Would you be able to try my fix (https://github.com/ARM-software/workload-automation/pull/900) and see if this fixes your issue?
Yes, I confirm I do no longer see the error now. Thanks !
Cool thank you, I've merged the fix into master.
Hi. I do see this problem a lot too. I did take the latest fixes but logcat fails with this error message more than 1/3rd of the time
ERROR File "/mnt/data/src/lisa/tools/workload-automation/wa/framework/job.py", line 134, in setup
ERROR self.workload.setup(context)
ERROR File "/mnt/data/src/lisa/tools/workload-automation/wa/workloads/jankbench/__init__.py", line 104, in setup
ERROR self.monitor = JankbenchRunMonitor(self.target)
ERROR File "/mnt/data/src/lisa/tools/workload-automation/wa/workloads/jankbench/__init__.py", line 211, in __init__
ERROR self.target.clear_logcat()
ERROR File "/home/user/src/lisa/libs/devlib/devlib/target.py", line 1387, in clear_logcat
ERROR adb_command(self.adb_name, 'logcat -c', timeout=30)
ERROR File "/home/user/src/lisa/libs/devlib/devlib/utils/android.py", line 474, in adb_command
ERROR output, _ = check_output(full_command, timeout, shell=True)
ERROR File "/home/user/src/lisa/libs/devlib/devlib/utils/misc.py", line 198, in check_output
ERROR raise subprocess.CalledProcessError(retcode, command, output='\n'.join([output or '', error or '']))
ERROR
ERROR Command 'adb -s HIKEY960_1 logcat -c' returned non-zero exit status 1
ERROR OUTPUT:
ERROR
ERROR failed to clear the 'system' log
ERROR
ERROR
Applying this change fixes it for me
diff --git a/wa/workloads/jankbench/__init__.py b/wa/workloads/jankbench/__init__.py
index d4c60cef..1238533b 100644
--- a/wa/workloads/jankbench/__init__.py
+++ b/wa/workloads/jankbench/__init__.py
@@ -208,13 +208,13 @@ class JankbenchRunMonitor(threading.Thread):
self.daemon = True
self.run_ended = threading.Event()
self.stop_event = threading.Event()
- self.target.clear_logcat()
if self.target.adb_name:
self.command = ['adb', '-s', self.target.adb_name, 'logcat']
else:
self.command = ['adb', 'logcat']
def run(self):
+ self.target.clear_logcat()
proc = subprocess.Popen(self.command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
while not self.stop_event.is_set():
if self.run_ended.is_set():
Is there some python rule about starting processes and waiting for it in __init__()
context? I can push a PR with the fix if it's the right thing to do.
Hi. There are no special rules as far as I'm aware of, however I'm not sure how this should be affecting the board this way. Would you be able to send me the run logs from before and after your fix to see what is happening?
Sure. My 'fix' could be a workaround then but it reliably fixes the issue for me and my colleague too.
AFAICT the 'failed to clear 'system' log' message is produced by adb itself
https://github.com/aosp-mirror/platform_system_core/blob/master/logcat/logcat.cpp#L1496
Why logcat fails to clear 'system' (and at other times 'main') logs sometimes when called from __init__()
is werid. I am on Android Q by the way - and did try to ensure that my local adb version matches the server on the device (v40) but that didn't help.
When the failure happens I see this interesting snippet:
908 2018-10-04 11:44:38,175 DEBUG android: adb -s HIKEY960_1 shell am kill-all
909 2018-10-04 11:44:38,251 DEBUG android: adb -s HIKEY960_1 logcat -c
910 2018-10-04 11:44:48,299 DEBUG android: adb -s HIKEY960_1 shell dumpsys power
911 2018-10-04 11:44:48,352 DEBUG android: adb -s HIKEY960_1 logcat -c
912 2018-10-04 11:44:51,431 DEBUG signal: Sending error-logged from <wa.utils.log.ErrorSignalHandler object at 0x7f89ba974850>
913 2018-10-04 11:44:51,432 DEBUG signal: Disconnecting <bound method Executor._error_signalled_callback of executor> from error-logged(<Sender: Any>)
914 2018-10-04 11:44:51,431 ERROR signal: File "/mnt/data/src/lisa/tools/workload-automation/wa/framework/job.py", line 134, in setup
915 2018-10-04 11:44:51,431 ERROR signal: self.workload.setup(context)
916 2018-10-04 11:44:51,431 ERROR signal: File "/mnt/data/src/lisa/tools/workload-automation/wa/workloads/jankbench/__init__.py", line 104, in setup
917 2018-10-04 11:44:51,431 ERROR signal: self.monitor = JankbenchRunMonitor(self.target)
918 2018-10-04 11:44:51,431 ERROR signal: File "/mnt/data/src/lisa/tools/workload-automation/wa/workloads/jankbench/__init__.py", line 211, in __init__
919 2018-10-04 11:44:51,431 ERROR signal: self.target.clear_logcat()
920 2018-10-04 11:44:51,431 ERROR signal: File "/home/usr/src/lisa/libs/devlib/devlib/target.py", line 1387, in clear_logcat
921 2018-10-04 11:44:51,431 ERROR signal: adb_command(self.adb_name, 'logcat -c', timeout=30)
922 2018-10-04 11:44:51,431 ERROR signal: File "/home/usr/src/lisa/libs/devlib/devlib/utils/android.py", line 474, in adb_command
923 2018-10-04 11:44:51,431 ERROR signal: output, _ = check_output(full_command, timeout, shell=True)
924 2018-10-04 11:44:51,431 ERROR signal: File "/home/usr/src/lisa/libs/devlib/devlib/utils/misc.py", line 198, in check_output
925 2018-10-04 11:44:51,431 ERROR signal: raise subprocess.CalledProcessError(retcode, command, output='\n'.join([output or '', error or '']))
926 2018-10-04 11:44:51,431 ERROR signal:
927 2018-10-04 11:44:51,432 DEBUG signal: Sending error-logged from <wa.utils.log.ErrorSignalHandler object at 0x7f89ba974850>
928 2018-10-04 11:44:51,433 DEBUG signal: Disconnecting <bound method Executor._error_signalled_callback of executor> from error-logged(<Sender: Any>)
929 2018-10-04 11:44:51,432 ERROR signal: Command 'adb -s HIKEY960_1 logcat -c' returned non-zero exit status 1
930 2018-10-04 11:44:51,432 ERROR signal: OUTPUT:
931 2018-10-04 11:44:51,432 ERROR signal:
932 2018-10-04 11:44:51,432 ERROR signal: failed to clear the 'system' log
933 2018-10-04 11:44:51,432 ERROR signal:
934 2018-10-04 11:44:51,432 ERROR signal:
Using my workaround
466 2018-10-04 10:33:08,681 DEBUG android: adb -s HIKEY960_1 shell am kill-all
467 2018-10-04 10:33:08,785 DEBUG android: adb -s HIKEY960_1 logcat -c
468 2018-10-04 10:33:18,837 DEBUG android: adb -s HIKEY960_1 shell dumpsys power
469 2018-10-04 10:33:18,900 DEBUG signal: Sending successful-workload-setup from wk1 (jankbench) [1]
470 2018-10-04 10:33:18,900 DEBUG android: adb -s HIKEY960_1 shell "ls /data/local/tmp > /dev/null"
471 2018-10-04 10:33:18,900 DEBUG signal: Sending after-workload-setup from wk1 (jankbench) [1]
472 2018-10-04 10:33:18,901 INFO job: Running job wk1 (jankbench) [1]
473 2018-10-04 10:33:18,904 DEBUG signal: Sending before-workload-execution from wk1 (jankbench) [1]
474 2018-10-04 10:33:18,905 DEBUG android: adb -s HIKEY960_1 shell am start -W -S -n com.android.benchmark/.app.RunLocalBenchmarksActivity --ei com.android.benchmark.EXTRA_RUN_COUNT 1
475 2018-10-04 10:33:18,967 DEBUG android: adb -s HIKEY960_1 shell '(ls -1); echo "
476 2018-10-04 10:33:18,967 DEBUG android: $?"'
477 2018-10-04 10:33:19,041 DEBUG android: ls command is set to ls -1
478 2018-10-04 10:33:19,042 DEBUG android: adb -s HIKEY960_1 logcat -c
479 2018-10-04 10:33:20,297 DEBUG jankbench: Stopping: com.android.benchmark
480 2018-10-04 10:33:20,297 DEBUG jankbench: Starting: Intent { cmp=com.android.benchmark/.app.RunLocalBenchmarksActivity (has extras) }
481 2018-10-04 10:33:20,297 DEBUG jankbench: Status: ok
482 2018-10-04 10:33:20,297 DEBUG jankbench: Activity: com.android.benchmark/.app.RunLocalBenchmarksActivity
483 2018-10-04 10:33:20,297 DEBUG jankbench: ThisTime: 853
484 2018-10-04 10:33:20,297 DEBUG jankbench: TotalTime: 853
485 2018-10-04 10:33:20,297 DEBUG jankbench: WaitTime: 1053
486 2018-10-04 10:33:20,297 DEBUG jankbench: Complete
487 2018-10-04 10:33:20,297 DEBUG jankbench:
The interesting bit is the 2 logcat -c
sandwiching dumpsys power
. My workaround moves the 2nd logcat -c
to happen later. I'll try removing the logcat -c altogether to see what happens.
What does dumpsys powe
do? I tried running this yesterday and spamming adb with logcat -c
didn't generate any failure in over 1500 runs
i=0; while true; do i=$((i+1)); echo "$i"; adb logcat -c; sleep 1; if [ $? -ne 0 ]; then echo "***** FAILED *****"; break; fi; done
I tried the above with and without the sleep 1
by the way and both were happy. So maybe dumpsys power has a side effect.
I can email you the full run.log files if you want.
Removing self.target.clear_logcat()
from __init__()
seems to make the problem disappear too. I'll continue running with that to see if the problem completely disappears
I am on Android Q by the way - and did try to ensure that my local adb version matches the server on the device (v40) but that didn't help.
Thank you for providing the logs, I'm now using the same android and adb versions as specified and I've managed to encounter a similar error. I can however also reproduce it by calling the clear_logcat
method by itself so not sure if this is the exact same issue.
What does dumpsys power do? ... I tried the above with and without the sleep 1 by the way and both were happy. So maybe dumpsys power has a side effect.
This is what we use to help determine the state of the device e.g. the screen state etc. However I don't think it should interact with logcat and I've tried running the two commands in a loop and I haven't managed to find a link so far. If you try adding the extra command into your testing loop do you see the same behaviour?
Removing self.target.clear_logcat() from init() seems to make the problem disappear too. I'll continue running with that to see if the problem completely disappears
I think it might be worthwhile removing the call from the logcat poller as the 2 calls happen so close to each other, however I think this would still be more of a workaround rather than a fix for the underlying issue.
There's something fishy and it's worthwhile trying to understand it I think. With the call to self.targer.clear_logc()
commented out from __init__()
I hit a similar error but this time when running exoplayer workload (first time I hit it btw)
INFO Running job default-exoplayer_ogg_128kbps_30s
INFO Configuring augmentations
INFO Configuring target for job default-exoplayer_ogg_128kbps_30s (exoplayer) [2]
INFO Setting up job default-exoplayer_ogg_128kbps_30s (exoplayer) [2]
ERROR File "/mnt/data/src/lisa/tools/workload-automation/wa/framework/job.py", line 134, in setup
ERROR self.workload.setup(context)
ERROR File "/mnt/data/src/lisa/tools/workload-automation/wa/workloads/exoplayer/__init__.py", line 185, in setup
ERROR self.monitor.start()
ERROR File "/home/usr/src/lisa/libs/devlib/devlib/utils/android.py", line 601, in start
ERROR self.target.clear_logcat()
ERROR File "/home/usr/src/lisa/libs/devlib/devlib/target.py", line 1387, in clear_logcat
ERROR adb_command(self.adb_name, 'logcat -c', timeout=30)
ERROR File "/home/usr/src/lisa/libs/devlib/devlib/utils/android.py", line 474, in adb_command
ERROR output, _ = check_output(full_command, timeout, shell=True)
ERROR File "/home/usr/src/lisa/libs/devlib/devlib/utils/misc.py", line 198, in check_output
ERROR raise subprocess.CalledProcessError(retcode, command, output='\n'.join([output or '', error or '']))
ERROR
ERROR Command 'adb -s HIKEY960_1 logcat -c' returned non-zero exit status 1
ERROR OUTPUT:
ERROR
ERROR failed to clear the 'main' log
ERROR
ERROR
INFO Recording screen state...
INFO Completing job default-exoplayer_ogg_128kbps_30s
ERROR Job default-exoplayer_ogg_128kbps_30s iteration 2 completed with status FAILED. Max retries exceeded.
From run.log
2364 2018-10-04 13:44:03,941 DEBUG android: adb -s HIKEY960_1 shell am kill-all
2365 2018-10-04 13:44:04,024 DEBUG android: adb -s HIKEY960_1 logcat -c
2366 2018-10-04 13:44:14,083 DEBUG android: adb -s HIKEY960_1 shell dumpsys package com.google.android.exoplayer2.demo
2367 2018-10-04 13:44:14,147 DEBUG android: adb -s HIKEY960_1 shell pm grant com.google.android.exoplayer2.demo android.permission.INTERNET
2368 2018-10-04 13:44:14,214 DEBUG android: Cannot grant android.permission.INTERNET
2369 2018-10-04 13:44:14,215 DEBUG android: adb -s HIKEY960_1 shell pm grant com.google.android.exoplayer2.demo android.permission.READ_EXTERNAL_STORAGE
2370 2018-10-04 13:44:14,278 DEBUG android: adb -s HIKEY960_1 shell echo 'if [ -e '\''/sdcard/devlib-target/128kbps_Tchaikovsky_-_Romeo_and_Juliet_Ouverture_-_Antal_Dorati_(1959).ogg'\'' ]; th en echo 1; else echo 0; fi' | su
2371 2018-10-04 13:44:14,352 DEBUG android: adb -s HIKEY960_1 shell settings get system user_rotation
2372 2018-10-04 13:44:14,415 DEBUG android: adb -s HIKEY960_1 shell settings put system accelerometer_rotation 0
2373 2018-10-04 13:44:14,478 DEBUG android: adb -s HIKEY960_1 shell settings put system user_rotation 0
2374 2018-10-04 13:44:14,635 DEBUG android: adb -s HIKEY960_1 logcat -c
2375 2018-10-04 13:44:17,730 DEBUG signal: Sending error-logged from <wa.utils.log.ErrorSignalHandler object at 0x7f29ff39e850>
2376 2018-10-04 13:44:17,731 DEBUG signal: Disconnecting <bound method Executor._error_signalled_callback of executor> from error-logged(<Sender: Any>)
2377 2018-10-04 13:44:17,730 ERROR signal: File "/mnt/data/src/lisa/tools/workload-automation/wa/framework/job.py", line 134, in setup
2378 2018-10-04 13:44:17,730 ERROR signal: self.workload.setup(context)
2379 2018-10-04 13:44:17,730 ERROR signal: File "/mnt/data/src/lisa/tools/workload-automation/wa/workloads/exoplayer/__init__.py", line 185, in setup
2380 2018-10-04 13:44:17,730 ERROR signal: self.monitor.start()
There are 2 calls to logcat -c
but they aren't close enough :-/ I am starting to doubt a bug in adb v40..
If you try adding the extra command into your testing loop do you see the same behaviour?
I'll try and let you know
There are 2 calls to logcat -c but they aren't close enough :-/ I am starting to doubt a bug in adb v40..
It does look like it might be v40. I've tried rolling back my adb version and do not appear to have the same issue any more, update back to v40 and it's back.
If you try adding the extra command into your testing loop do you see the same behaviour?
Doing adb logcat -c; adb shell dumpsys power; adb logcat -c;
in a loop didn't generate any errors for me. So something about the way environment from which these commands run is creating a problem.
It does look like it might be v40. I've tried rolling back my adb version and do not appear to have the same issue any more, update back to v40 and it's back.
Are you referring to adbd running on the device or adb on the host machine? I usually run with adb v36 on my host and it fails; tried v40 on the host and didn't help. Didn't try downgrading adbd though; if that's what you did, how did you do that so I can give it a go too?
Doing
adb logcat -c; adb shell dumpsys power; adb logcat -c;
in a loop didn't generate any errors for me. So something about the way environment from which these commands run is creating a problem.
Ah ok it does appear that way, although I have not manage to find a reliable reproducer as of yet.
Are you referring to adbd running on the device or adb on the host machine? I usually run with adb v36 on my host and it fails; tried v40 on the host and didn't help. Didn't try downgrading adbd though; if that's what you did, how did you do that so I can give it a go too?
That's interesting, sorry this was just my host version I tried changing that seem to have that effect, I have not tried to change my from running adbd (v39) either.
I test with this workload and I see it at least once every run
workloads:
- name: jankbench
iterations: 6
I test with this workload and I see it at least once every run
workloads:
- name: jankbench
iterations: 6
Ok that does seem to produce the error at least once but within that I haven't found a pattern of when it will occur. I guess it must be some sort of race condition somewhere although I'm not sure what could be causing that. Have you spotted any correlations?
No. I tried disabling the logact polling thread but this didn't seem to help. It might be the threading library causing the weird side effect. I'll keep looking as a lower priority background investigation; in the meantime I'll use my workaround to carry on with my work.
By the way I noticed that uiautomation base does call logcat as well - that might explain the exoplayer failure as python and uiautomation 'might' be doing some logcat operation simultaneously. But I think this is unlikely in the normal flow of operation. I don't know.
I've tried moving the calls to clear logcat into the setup method of the workload and this still appears to give the same problem so it doesn't appear to be the threading library, rather just the multiple calls themselves which I shall continue to investigate.
I've taken a look through the exoplayer workload and this also does logcat polling in the background so would make sense that your seeing the same issue there, although that appears to be done via devlib as well so should still be semi protected from concurrent operations.
Interestingly when running the jankbench workload; running ps -ax | grep adb
returns interesting results
19549 ? Ssl 0:01 adb -L tcp:5037 fork-server server --reply-fd 5
22253 pts/5 S+ 0:00 adb -s HIKEY960_1 logcat
24356 pts/5 S+ 0:00 adb -s HIKEY960_1 logcat
26498 pts/5 S+ 0:00 adb -s HIKEY960_1 logcat
27040 pts/18 S+ 0:00 grep --color=auto adb
Note that was while the test was in the 5th iteration. Iteration 2 failed with the usual logact -c error. At the time there was only one 'ghost' adb logcat
process running.
It looks like something that interacts with logcat doesn't terminate correctly between runs and potentially might be causing the occasional logcat -c errors.
I couldn't figure out who spawns this logcat process yet.
May be you can display the parent pid ?
This doesn't help. The parent is the command I've invoked to start the wa test. Logcat commands are called from devlib, workload-automation and java. I can't tell which with one of them that spawns this.
I doubt it's java since jankbench doesn't have a uiautomation module, so unless there's a generic java uiautomation that gets to run automatically, it has to be one of the python modules; I think.
I'm seeing similar duplicated processes on my end so will track down what is spawning each one. We do not have a generic java automation so it will be coming from python somewhere.
I have also found this issue https://github.com/google/mobly/pull/501 which indicates that the failure to clear logcat is a known android bug at least on Android O so the duplicated processes may be making hitting this bug more frequent rather than the root cause.
Silly me. Jankbench run() method executes adb logcat and doesn't do a proc.kill() once it's done. Let me try that to see if it hopefully solves the problem
Preliminary results of this change look promising
diff --git a/wa/workloads/jankbench/__init__.py b/wa/workloads/jankbench/__init__.py
index d4c60cef..5ccc6214 100644
--- a/wa/workloads/jankbench/__init__.py
+++ b/wa/workloads/jankbench/__init__.py
@@ -228,6 +228,8 @@ class JankbenchRunMonitor(threading.Thread):
if self.regex.search(line):
self.run_ended.set()
+ proc.terminate()
+
def stop(self):
self.stop_event.set()
self.join()
I'll test with this for the rest of the day and see if I encounter more problems.
Worth noting that I have seen another problem that might be related to this where jankbench tests 'hangs' in the middle of a long test suite run that includes multiple workloads and a lot of iterations. I think that's when I noticed that there were too many adb processes in my system - but sadly didn't investigate that instance properly. I'm hoping it's another side effect of this problem - let's see.
OK I had a full run of the test suite I was running from LISA without any failures. Are you happy with this as a fix? I can create a PR if you like or feel free to push yours - I don't mind.
Ah excellent, its looks like its solved the issue on my end as well so yes I'm happy to push the fix for that (https://github.com/ARM-software/workload-automation/pull/914).
Have you been having any more problems with exoplayer?
No everything was happy as the test suite runs jankbench, exoplayer, pcmark and geekbench for many iterations and they all ran to completion as far as I see.
FWIW, here's my commit message in case it helps to document the issues I've encountered
https://github.com/qais-yousef/workload-automation/commit/6c1b8ff4d948aa89b36c4efc33842257f1af8dbd
Ok, thank you very much for your help debugging, please me know if you run into any more issues.
Hi,
not sure this is related to WA or if it is how logcat is called but if I run an agenda with multiple iterations, sometimes the test fails at the end because of:
ERROR Got exit code 1 ERROR from target command: logcat -c ERROR OUTPUT: failed to clear the 'main' log