ARM-software / workload-automation

A framework for automating workload execution and measurement collection on ARM devices.
Apache License 2.0
151 stars 120 forks source link

logcat -c fails randomly with multiple iterations #898

Closed dlezcano closed 6 years ago

dlezcano commented 6 years ago

Hi,

not sure this is related to WA or if it is how logcat is called but if I run an agenda with multiple iterations, sometimes the test fails at the end because of:

ERROR Got exit code 1 ERROR from target command: logcat -c ERROR OUTPUT: failed to clear the 'main' log

marcbonnici commented 6 years ago

Hi,

I have not come across that issue before, do you get that same error if you run that command multiple times yourself? / Do you get this with multiple devices?

dlezcano commented 6 years ago

Yes, I get the issue if I run the command multiple times. Yes, this issue has been reported with another device but same platform (hikey960)

dlezcano commented 6 years ago

Perhaps, this is related to the error I can see in the console: [ 1192.961243] binder_alloc: 13020: binder_alloc_buf, no vma [ 1192.961243] binder_alloc: 13020: binder_alloc_buf, no vma [ 1192.972464] binder: 2724:10218 transaction failed 29189/-3, size 108-0 line 3135 [ 1192.972464] binder: 2724:10218 transaction failed 29189/-3, size 108-0 line 3135 [ 1193.014989] binder: undelivered transaction 338114, process died. [ 1193.014989] binder: undelivered transaction 338114, process died. [ 1193.031581] binder: undelivered transaction 338093, process died. [ 1193.031581] binder: undelivered transaction 338093, process died.

marcbonnici commented 6 years ago

Is that with the the same image on the different devices? The other crashes in the console likely indicate that there is something else wrong with the Android system rather than a WA issue itself.

dlezcano commented 6 years ago

Yes, it is with the same image. Agree it must be an Android related issue. Closing the issue.

dlezcano commented 6 years ago

It seems like this error can happen when there are multiple instances of logcat running at the same time. I'm wondering if WA could have miss a 'wait' after exec'ing the adb logcat -c (or may be adb is closing the connection before waiting the child process?). I'm reopening this issue in order to double check if that can be the case from the WA side.

marcbonnici commented 6 years ago

The actual clearing of logcat on the device is performed via devlib (https://github.com/ARM-software/devlib/blob/master/devlib/target.py#L1386) and the implementation there uses a lock to prevent multiple instances and will wait until the command is completed to check the error code so I think these cases should already be guarded against.

When you say the error occurs with multiple instances of logcat running, do you still experience it if you are not also monitoring logcat yourself?

dlezcano commented 6 years ago

Sorry I missed your answer. I reproduced the issue and monitored a 'logcat' instance. There is a: shell 28556 24458 0 08:36:59 pts/1 00:00:00 logcat

stuck on the 'recvmsg' syscall on a socket.

I verified before running the scenario no instance was running. So somehow, the logcat is blocked on the socket and WA3 does not wait it (or does not return an error if a timeout occurs leading the error for the next run).

dlezcano commented 6 years ago

Agenda triggering the issue:

sections:

workloads:

marcbonnici commented 6 years ago

Thank you very much for your agenda, I managed to reproduce the issue here and the jankbench monitor was clearing logcat itself rather than using the method available in devlib.

Would you be able to try my fix (https://github.com/ARM-software/workload-automation/pull/900) and see if this fixes your issue?

dlezcano commented 6 years ago

Yes, I confirm I do no longer see the error now. Thanks !

marcbonnici commented 6 years ago

Cool thank you, I've merged the fix into master.

qais-yousef commented 6 years ago

Hi. I do see this problem a lot too. I did take the latest fixes but logcat fails with this error message more than 1/3rd of the time

ERROR              File "/mnt/data/src/lisa/tools/workload-automation/wa/framework/job.py", line 134, in setup
ERROR                self.workload.setup(context)
ERROR              File "/mnt/data/src/lisa/tools/workload-automation/wa/workloads/jankbench/__init__.py", line 104, in setup
ERROR                self.monitor = JankbenchRunMonitor(self.target)
ERROR              File "/mnt/data/src/lisa/tools/workload-automation/wa/workloads/jankbench/__init__.py", line 211, in __init__
ERROR                self.target.clear_logcat()
ERROR              File "/home/user/src/lisa/libs/devlib/devlib/target.py", line 1387, in clear_logcat
ERROR                adb_command(self.adb_name, 'logcat -c', timeout=30)
ERROR              File "/home/user/src/lisa/libs/devlib/devlib/utils/android.py", line 474, in adb_command
ERROR                output, _ = check_output(full_command, timeout, shell=True)
ERROR              File "/home/user/src/lisa/libs/devlib/devlib/utils/misc.py", line 198, in check_output
ERROR                raise subprocess.CalledProcessError(retcode, command, output='\n'.join([output or '', error or '']))
ERROR    
ERROR            Command 'adb -s HIKEY960_1 logcat -c' returned non-zero exit status 1
ERROR            OUTPUT:
ERROR    
ERROR            failed to clear the 'system' log
ERROR    
ERROR    

Applying this change fixes it for me

diff --git a/wa/workloads/jankbench/__init__.py b/wa/workloads/jankbench/__init__.py
index d4c60cef..1238533b 100644
--- a/wa/workloads/jankbench/__init__.py
+++ b/wa/workloads/jankbench/__init__.py
@@ -208,13 +208,13 @@ class JankbenchRunMonitor(threading.Thread):
         self.daemon = True
         self.run_ended = threading.Event()
         self.stop_event = threading.Event()
-        self.target.clear_logcat()
         if self.target.adb_name:
             self.command = ['adb', '-s', self.target.adb_name, 'logcat']
         else:
             self.command = ['adb', 'logcat']

     def run(self):
+        self.target.clear_logcat()
         proc = subprocess.Popen(self.command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
         while not self.stop_event.is_set():
             if self.run_ended.is_set():

Is there some python rule about starting processes and waiting for it in __init__() context? I can push a PR with the fix if it's the right thing to do.

marcbonnici commented 6 years ago

Hi. There are no special rules as far as I'm aware of, however I'm not sure how this should be affecting the board this way. Would you be able to send me the run logs from before and after your fix to see what is happening?

qais-yousef commented 6 years ago

Sure. My 'fix' could be a workaround then but it reliably fixes the issue for me and my colleague too.

AFAICT the 'failed to clear 'system' log' message is produced by adb itself

https://github.com/aosp-mirror/platform_system_core/blob/master/logcat/logcat.cpp#L1496

Why logcat fails to clear 'system' (and at other times 'main') logs sometimes when called from __init__() is werid. I am on Android Q by the way - and did try to ensure that my local adb version matches the server on the device (v40) but that didn't help.

qais-yousef commented 6 years ago

When the failure happens I see this interesting snippet:

908 2018-10-04 11:44:38,175 DEBUG       android:         adb -s HIKEY960_1 shell am kill-all                                                                                                                 
 909 2018-10-04 11:44:38,251 DEBUG       android:         adb -s HIKEY960_1 logcat -c                                                                                                                         
 910 2018-10-04 11:44:48,299 DEBUG       android:         adb -s HIKEY960_1 shell dumpsys power                                                                                                               
 911 2018-10-04 11:44:48,352 DEBUG       android:         adb -s HIKEY960_1 logcat -c                                                                                                                         
 912 2018-10-04 11:44:51,431 DEBUG        signal:         Sending error-logged from <wa.utils.log.ErrorSignalHandler object at 0x7f89ba974850>                                                                
 913 2018-10-04 11:44:51,432 DEBUG        signal:         Disconnecting <bound method Executor._error_signalled_callback of executor> from error-logged(<Sender: Any>)                                        
 914 2018-10-04 11:44:51,431 ERROR        signal:           File "/mnt/data/src/lisa/tools/workload-automation/wa/framework/job.py", line 134, in setup                                                       
 915 2018-10-04 11:44:51,431 ERROR        signal:             self.workload.setup(context)                                                                                                                    
 916 2018-10-04 11:44:51,431 ERROR        signal:           File "/mnt/data/src/lisa/tools/workload-automation/wa/workloads/jankbench/__init__.py", line 104, in setup                                        
 917 2018-10-04 11:44:51,431 ERROR        signal:             self.monitor = JankbenchRunMonitor(self.target)                                                                                                 
 918 2018-10-04 11:44:51,431 ERROR        signal:           File "/mnt/data/src/lisa/tools/workload-automation/wa/workloads/jankbench/__init__.py", line 211, in __init__                                     
 919 2018-10-04 11:44:51,431 ERROR        signal:             self.target.clear_logcat()                                                                                                                      
 920 2018-10-04 11:44:51,431 ERROR        signal:           File "/home/usr/src/lisa/libs/devlib/devlib/target.py", line 1387, in clear_logcat                                                            
 921 2018-10-04 11:44:51,431 ERROR        signal:             adb_command(self.adb_name, 'logcat -c', timeout=30)                                                                                             
 922 2018-10-04 11:44:51,431 ERROR        signal:           File "/home/usr/src/lisa/libs/devlib/devlib/utils/android.py", line 474, in adb_command                                                       
 923 2018-10-04 11:44:51,431 ERROR        signal:             output, _ = check_output(full_command, timeout, shell=True)                                                                                     
 924 2018-10-04 11:44:51,431 ERROR        signal:           File "/home/usr/src/lisa/libs/devlib/devlib/utils/misc.py", line 198, in check_output                                                         
 925 2018-10-04 11:44:51,431 ERROR        signal:             raise subprocess.CalledProcessError(retcode, command, output='\n'.join([output or '', error or '']))                                            
 926 2018-10-04 11:44:51,431 ERROR        signal:                                                                                                                                                             
 927 2018-10-04 11:44:51,432 DEBUG        signal:         Sending error-logged from <wa.utils.log.ErrorSignalHandler object at 0x7f89ba974850>                                                                
 928 2018-10-04 11:44:51,433 DEBUG        signal:         Disconnecting <bound method Executor._error_signalled_callback of executor> from error-logged(<Sender: Any>)                                        
 929 2018-10-04 11:44:51,432 ERROR        signal:         Command 'adb -s HIKEY960_1 logcat -c' returned non-zero exit status 1                                                                               
 930 2018-10-04 11:44:51,432 ERROR        signal:         OUTPUT:                                                                                                                                             
 931 2018-10-04 11:44:51,432 ERROR        signal:                                                                                                                                                             
 932 2018-10-04 11:44:51,432 ERROR        signal:         failed to clear the 'system' log                                                                                                                    
 933 2018-10-04 11:44:51,432 ERROR        signal:                                                                                                                                                             
 934 2018-10-04 11:44:51,432 ERROR        signal:     

Using my workaround

466 2018-10-04 10:33:08,681 DEBUG       android:         adb -s HIKEY960_1 shell am kill-all                                                                                                                 
 467 2018-10-04 10:33:08,785 DEBUG       android:         adb -s HIKEY960_1 logcat -c                                                                                                                         
 468 2018-10-04 10:33:18,837 DEBUG       android:         adb -s HIKEY960_1 shell dumpsys power                                                                                                               
 469 2018-10-04 10:33:18,900 DEBUG        signal:         Sending successful-workload-setup from wk1 (jankbench) [1]                                                                                          
 470 2018-10-04 10:33:18,900 DEBUG       android:         adb -s HIKEY960_1 shell "ls /data/local/tmp > /dev/null"                                                                                            
 471 2018-10-04 10:33:18,900 DEBUG        signal:         Sending after-workload-setup from wk1 (jankbench) [1]                                                                                               
 472 2018-10-04 10:33:18,901 INFO            job:     Running job wk1 (jankbench) [1]                                                                                                                         
 473 2018-10-04 10:33:18,904 DEBUG        signal:         Sending before-workload-execution from wk1 (jankbench) [1]                                                                                          
 474 2018-10-04 10:33:18,905 DEBUG       android:         adb -s HIKEY960_1 shell am start -W -S -n com.android.benchmark/.app.RunLocalBenchmarksActivity --ei com.android.benchmark.EXTRA_RUN_COUNT 1        
 475 2018-10-04 10:33:18,967 DEBUG       android:         adb -s HIKEY960_1 shell '(ls -1); echo "                                                                                                            
 476 2018-10-04 10:33:18,967 DEBUG       android:         $?"'                                                                                                                                                
 477 2018-10-04 10:33:19,041 DEBUG       android:         ls command is set to ls -1                                                                                                                          
 478 2018-10-04 10:33:19,042 DEBUG       android:         adb -s HIKEY960_1 logcat -c                                                                                                                         
 479 2018-10-04 10:33:20,297 DEBUG     jankbench:         Stopping: com.android.benchmark                                                                                                                     
 480 2018-10-04 10:33:20,297 DEBUG     jankbench:         Starting: Intent { cmp=com.android.benchmark/.app.RunLocalBenchmarksActivity (has extras) }                                                         
 481 2018-10-04 10:33:20,297 DEBUG     jankbench:         Status: ok                                                                                                                                          
 482 2018-10-04 10:33:20,297 DEBUG     jankbench:         Activity: com.android.benchmark/.app.RunLocalBenchmarksActivity                                                                                     
 483 2018-10-04 10:33:20,297 DEBUG     jankbench:         ThisTime: 853                                                                                                                                       
 484 2018-10-04 10:33:20,297 DEBUG     jankbench:         TotalTime: 853                                                                                                                                      
 485 2018-10-04 10:33:20,297 DEBUG     jankbench:         WaitTime: 1053                                                                                                                                      
 486 2018-10-04 10:33:20,297 DEBUG     jankbench:         Complete                                                                                                                                            
 487 2018-10-04 10:33:20,297 DEBUG     jankbench:                                              

The interesting bit is the 2 logcat -c sandwiching dumpsys power. My workaround moves the 2nd logcat -c to happen later. I'll try removing the logcat -c altogether to see what happens.

What does dumpsys powe do? I tried running this yesterday and spamming adb with logcat -c didn't generate any failure in over 1500 runs

i=0; while true; do i=$((i+1)); echo "$i"; adb logcat -c; sleep 1; if [ $? -ne 0 ]; then echo "***** FAILED *****"; break; fi; done

I tried the above with and without the sleep 1 by the way and both were happy. So maybe dumpsys power has a side effect.

I can email you the full run.log files if you want.

qais-yousef commented 6 years ago

Removing self.target.clear_logcat() from __init__() seems to make the problem disappear too. I'll continue running with that to see if the problem completely disappears

marcbonnici commented 6 years ago

I am on Android Q by the way - and did try to ensure that my local adb version matches the server on the device (v40) but that didn't help.

Thank you for providing the logs, I'm now using the same android and adb versions as specified and I've managed to encounter a similar error. I can however also reproduce it by calling the clear_logcat method by itself so not sure if this is the exact same issue.

What does dumpsys power do? ... I tried the above with and without the sleep 1 by the way and both were happy. So maybe dumpsys power has a side effect.

This is what we use to help determine the state of the device e.g. the screen state etc. However I don't think it should interact with logcat and I've tried running the two commands in a loop and I haven't managed to find a link so far. If you try adding the extra command into your testing loop do you see the same behaviour?

Removing self.target.clear_logcat() from init() seems to make the problem disappear too. I'll continue running with that to see if the problem completely disappears

I think it might be worthwhile removing the call from the logcat poller as the 2 calls happen so close to each other, however I think this would still be more of a workaround rather than a fix for the underlying issue.

qais-yousef commented 6 years ago

There's something fishy and it's worthwhile trying to understand it I think. With the call to self.targer.clear_logc() commented out from __init__() I hit a similar error but this time when running exoplayer workload (first time I hit it btw)

INFO     Running job default-exoplayer_ogg_128kbps_30s
INFO         Configuring augmentations
INFO         Configuring target for job default-exoplayer_ogg_128kbps_30s (exoplayer) [2]
INFO         Setting up job default-exoplayer_ogg_128kbps_30s (exoplayer) [2]
ERROR              File "/mnt/data/src/lisa/tools/workload-automation/wa/framework/job.py", line 134, in setup
ERROR                self.workload.setup(context)
ERROR              File "/mnt/data/src/lisa/tools/workload-automation/wa/workloads/exoplayer/__init__.py", line 185, in setup
ERROR                self.monitor.start()
ERROR              File "/home/usr/src/lisa/libs/devlib/devlib/utils/android.py", line 601, in start
ERROR                self.target.clear_logcat()
ERROR              File "/home/usr/src/lisa/libs/devlib/devlib/target.py", line 1387, in clear_logcat
ERROR                adb_command(self.adb_name, 'logcat -c', timeout=30)
ERROR              File "/home/usr/src/lisa/libs/devlib/devlib/utils/android.py", line 474, in adb_command
ERROR                output, _ = check_output(full_command, timeout, shell=True)
ERROR              File "/home/usr/src/lisa/libs/devlib/devlib/utils/misc.py", line 198, in check_output
ERROR                raise subprocess.CalledProcessError(retcode, command, output='\n'.join([output or '', error or '']))
ERROR    
ERROR            Command 'adb -s HIKEY960_1 logcat -c' returned non-zero exit status 1
ERROR            OUTPUT:
ERROR    
ERROR            failed to clear the 'main' log
ERROR    
ERROR    
INFO         Recording screen state...
INFO         Completing job default-exoplayer_ogg_128kbps_30s
ERROR    Job default-exoplayer_ogg_128kbps_30s iteration 2 completed with status FAILED. Max retries exceeded.

From run.log

2364 2018-10-04 13:44:03,941 DEBUG       android:         adb -s HIKEY960_1 shell am kill-all                                                                                                                 
2365 2018-10-04 13:44:04,024 DEBUG       android:         adb -s HIKEY960_1 logcat -c                                                                                                                         
2366 2018-10-04 13:44:14,083 DEBUG       android:         adb -s HIKEY960_1 shell dumpsys package com.google.android.exoplayer2.demo                                                                          
2367 2018-10-04 13:44:14,147 DEBUG       android:         adb -s HIKEY960_1 shell pm grant com.google.android.exoplayer2.demo android.permission.INTERNET                                                     
2368 2018-10-04 13:44:14,214 DEBUG       android:         Cannot grant android.permission.INTERNET                                                                                                            
2369 2018-10-04 13:44:14,215 DEBUG       android:         adb -s HIKEY960_1 shell pm grant com.google.android.exoplayer2.demo android.permission.READ_EXTERNAL_STORAGE                                        
2370 2018-10-04 13:44:14,278 DEBUG       android:         adb -s HIKEY960_1 shell echo 'if [ -e '\''/sdcard/devlib-target/128kbps_Tchaikovsky_-_Romeo_and_Juliet_Ouverture_-_Antal_Dorati_(1959).ogg'\'' ]; th     en echo 1; else echo 0; fi' | su                                                                                                                                                                         
2371 2018-10-04 13:44:14,352 DEBUG       android:         adb -s HIKEY960_1 shell settings get system user_rotation                                                                                           
2372 2018-10-04 13:44:14,415 DEBUG       android:         adb -s HIKEY960_1 shell settings put system accelerometer_rotation 0                                                                                
2373 2018-10-04 13:44:14,478 DEBUG       android:         adb -s HIKEY960_1 shell settings put system user_rotation 0                                                                                         
2374 2018-10-04 13:44:14,635 DEBUG       android:         adb -s HIKEY960_1 logcat -c                                                                                                                         
2375 2018-10-04 13:44:17,730 DEBUG        signal:         Sending error-logged from <wa.utils.log.ErrorSignalHandler object at 0x7f29ff39e850>                                                                
2376 2018-10-04 13:44:17,731 DEBUG        signal:         Disconnecting <bound method Executor._error_signalled_callback of executor> from error-logged(<Sender: Any>)                                        
2377 2018-10-04 13:44:17,730 ERROR        signal:           File "/mnt/data/src/lisa/tools/workload-automation/wa/framework/job.py", line 134, in setup                                                       
2378 2018-10-04 13:44:17,730 ERROR        signal:             self.workload.setup(context)                                                                                                                    
2379 2018-10-04 13:44:17,730 ERROR        signal:           File "/mnt/data/src/lisa/tools/workload-automation/wa/workloads/exoplayer/__init__.py", line 185, in setup                                        
2380 2018-10-04 13:44:17,730 ERROR        signal:             self.monitor.start()                  

There are 2 calls to logcat -c but they aren't close enough :-/ I am starting to doubt a bug in adb v40..

If you try adding the extra command into your testing loop do you see the same behaviour?

I'll try and let you know

marcbonnici commented 6 years ago

There are 2 calls to logcat -c but they aren't close enough :-/ I am starting to doubt a bug in adb v40..

It does look like it might be v40. I've tried rolling back my adb version and do not appear to have the same issue any more, update back to v40 and it's back.

qais-yousef commented 6 years ago

If you try adding the extra command into your testing loop do you see the same behaviour?

Doing adb logcat -c; adb shell dumpsys power; adb logcat -c; in a loop didn't generate any errors for me. So something about the way environment from which these commands run is creating a problem.

It does look like it might be v40. I've tried rolling back my adb version and do not appear to have the same issue any more, update back to v40 and it's back.

Are you referring to adbd running on the device or adb on the host machine? I usually run with adb v36 on my host and it fails; tried v40 on the host and didn't help. Didn't try downgrading adbd though; if that's what you did, how did you do that so I can give it a go too?

marcbonnici commented 6 years ago

Doing adb logcat -c; adb shell dumpsys power; adb logcat -c; in a loop didn't generate any errors for me. So something about the way environment from which these commands run is creating a problem.

Ah ok it does appear that way, although I have not manage to find a reliable reproducer as of yet.

Are you referring to adbd running on the device or adb on the host machine? I usually run with adb v36 on my host and it fails; tried v40 on the host and didn't help. Didn't try downgrading adbd though; if that's what you did, how did you do that so I can give it a go too?

That's interesting, sorry this was just my host version I tried changing that seem to have that effect, I have not tried to change my from running adbd (v39) either.

qais-yousef commented 6 years ago

I test with this workload and I see it at least once every run

 workloads:                                                                                                                                                                                                
    - name: jankbench                                                                                                                                                                                       
      iterations: 6                                                                                                                                                                                         
qais-yousef commented 6 years ago

I test with this workload and I see it at least once every run

 workloads:                                                                                                                                                                                                
    - name: jankbench                                                                                                                                                                                       
      iterations: 6                                                                                                                                                                                         
marcbonnici commented 6 years ago

Ok that does seem to produce the error at least once but within that I haven't found a pattern of when it will occur. I guess it must be some sort of race condition somewhere although I'm not sure what could be causing that. Have you spotted any correlations?

qais-yousef commented 6 years ago

No. I tried disabling the logact polling thread but this didn't seem to help. It might be the threading library causing the weird side effect. I'll keep looking as a lower priority background investigation; in the meantime I'll use my workaround to carry on with my work.

By the way I noticed that uiautomation base does call logcat as well - that might explain the exoplayer failure as python and uiautomation 'might' be doing some logcat operation simultaneously. But I think this is unlikely in the normal flow of operation. I don't know.

marcbonnici commented 6 years ago

I've tried moving the calls to clear logcat into the setup method of the workload and this still appears to give the same problem so it doesn't appear to be the threading library, rather just the multiple calls themselves which I shall continue to investigate.

I've taken a look through the exoplayer workload and this also does logcat polling in the background so would make sense that your seeing the same issue there, although that appears to be done via devlib as well so should still be semi protected from concurrent operations.

qais-yousef commented 6 years ago

Interestingly when running the jankbench workload; running ps -ax | grep adb returns interesting results

19549 ?        Ssl    0:01 adb -L tcp:5037 fork-server server --reply-fd 5
22253 pts/5    S+     0:00 adb -s HIKEY960_1 logcat
24356 pts/5    S+     0:00 adb -s HIKEY960_1 logcat
26498 pts/5    S+     0:00 adb -s HIKEY960_1 logcat
27040 pts/18   S+     0:00 grep --color=auto adb

Note that was while the test was in the 5th iteration. Iteration 2 failed with the usual logact -c error. At the time there was only one 'ghost' adb logcat process running.

It looks like something that interacts with logcat doesn't terminate correctly between runs and potentially might be causing the occasional logcat -c errors.

I couldn't figure out who spawns this logcat process yet.

dlezcano commented 6 years ago

May be you can display the parent pid ?

qais-yousef commented 6 years ago

This doesn't help. The parent is the command I've invoked to start the wa test. Logcat commands are called from devlib, workload-automation and java. I can't tell which with one of them that spawns this.

I doubt it's java since jankbench doesn't have a uiautomation module, so unless there's a generic java uiautomation that gets to run automatically, it has to be one of the python modules; I think.

marcbonnici commented 6 years ago

I'm seeing similar duplicated processes on my end so will track down what is spawning each one. We do not have a generic java automation so it will be coming from python somewhere.

I have also found this issue https://github.com/google/mobly/pull/501 which indicates that the failure to clear logcat is a known android bug at least on Android O so the duplicated processes may be making hitting this bug more frequent rather than the root cause.

qais-yousef commented 6 years ago

Silly me. Jankbench run() method executes adb logcat and doesn't do a proc.kill() once it's done. Let me try that to see if it hopefully solves the problem

qais-yousef commented 6 years ago

Preliminary results of this change look promising

diff --git a/wa/workloads/jankbench/__init__.py b/wa/workloads/jankbench/__init__.py
index d4c60cef..5ccc6214 100644
--- a/wa/workloads/jankbench/__init__.py
+++ b/wa/workloads/jankbench/__init__.py
@@ -228,6 +228,8 @@ class JankbenchRunMonitor(threading.Thread):
                     if self.regex.search(line):
                         self.run_ended.set()

+        proc.terminate()
+
     def stop(self):
         self.stop_event.set()
         self.join()

I'll test with this for the rest of the day and see if I encounter more problems.

Worth noting that I have seen another problem that might be related to this where jankbench tests 'hangs' in the middle of a long test suite run that includes multiple workloads and a lot of iterations. I think that's when I noticed that there were too many adb processes in my system - but sadly didn't investigate that instance properly. I'm hoping it's another side effect of this problem - let's see.

qais-yousef commented 6 years ago

OK I had a full run of the test suite I was running from LISA without any failures. Are you happy with this as a fix? I can create a PR if you like or feel free to push yours - I don't mind.

marcbonnici commented 6 years ago

Ah excellent, its looks like its solved the issue on my end as well so yes I'm happy to push the fix for that (https://github.com/ARM-software/workload-automation/pull/914).

Have you been having any more problems with exoplayer?

qais-yousef commented 6 years ago

No everything was happy as the test suite runs jankbench, exoplayer, pcmark and geekbench for many iterations and they all ran to completion as far as I see.

FWIW, here's my commit message in case it helps to document the issues I've encountered

https://github.com/qais-yousef/workload-automation/commit/6c1b8ff4d948aa89b36c4efc33842257f1af8dbd

marcbonnici commented 6 years ago

Ok, thank you very much for your help debugging, please me know if you run into any more issues.