aisingapore / TagUI

Free RPA tool by AI Singapore
Apache License 2.0
5.56k stars 578 forks source link

Chrome headless hangs - fix to work with new behaviour of newer Chrome versions #890

Closed kensoh closed 3 years ago

kensoh commented 3 years ago

see simple flow

tagui https://raw.githubusercontent.com/kelaberetiv/TagUI/master/flows/samples/1_google.tag -headless

run output log

START - automation started - Wed Dec 09 2020 15:06:29 GMT+0800 (+08)
https://www.google.com/ - Google

type q as latest movies[enter]

tagui chrome log

[tagui] START  - listening for inputs

[tagui] INPUT  - 
[1] {"id":1,"method":"Page.setDownloadBehavior","params":{"behavior":"allow","downloadPath":"/Users/kensoh/Desktop"}}
[tagui] OUTPUT - 
[1] {"id":1,"result":{}}

[tagui] INPUT  - 
[2] {"id":2,"method":"Page.navigate","params":{"url":"https://www.google.com/"}}
[tagui] OUTPUT - 
[2] {"id":2,"result":{"frameId":"2FDE9605F9AFA624135BFFBC7AD2F0D2","loaderId":"78128FB53CE1DBA80D8DFC2D2B83D7A9"}}

[tagui] INPUT  - 
[3] {"id":3,"method":"Runtime.evaluate","params":{"expression":"document.title"}}
[tagui] OUTPUT - 
[3] {"id":3,"result":{"result":{"type":"string","value":"Google"}}}

[tagui] INPUT  - 
[4] {"id":4,"method":"Runtime.evaluate","params":{"expression":"document.querySelectorAll('q').length"}}
[tagui] OUTPUT - 
[4] {"id":4,"result":{"result":{"type":"number","value":0,"description":"0"}}}

[tagui] INPUT  - 
[5] {"id":5,"method":"Runtime.evaluate","params":{"expression":"document.evaluate('//*[@id=\"q\"]',document,null,XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,null).snapshotLength"}}
[tagui] OUTPUT - 
[5] {"id":5,"result":{"result":{"type":"number","value":0,"description":"0"}}}

[tagui] INPUT  - 
[6] {"id":6,"method":"Runtime.evaluate","params":{"expression":"document.evaluate('//*[contains(@id,\"q\")]',document,null,XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,null).snapshotLength"}}
kensoh commented 3 years ago

Adding on, running the same flow in visible Chrome mode runs at normal speed instead of hanging waiting v long for reply.

This is likely due to some change in behaviour of headless Chrome in newer Chrome releases, as it previously works.

kensoh commented 3 years ago

on debugging the websocket connection, keeps getting below errors non-stop -

WebSocket\ConnectionException: Empty read; connection dead?  Stream state: {"timed_out":true,"blocked":true,"eof":false,"stream_type":"tcp_socket\/ssl","mode":"r+","unread_bytes":0,"seekable":false}
kensoh commented 3 years ago

running below code to test Chrome headless works -

https://medium.com/@lagenar/using-headless-chrome-via-the-websockets-interface-5f498fb67e0f

import json
import time
import subprocess
import requests
from websocket import create_connection

def start_browser(browser_path, debugging_port):
    options = ['--headless', ' --disable-gpu',
               '--remote-debugging-port={}'.format(debugging_port)]
    browser_proc = subprocess.Popen([browser_path] + options)
    wait_seconds = 10.0
    sleep_step = 0.25
    while wait_seconds > 0:
        try:
            url = 'http://127.0.0.1:{}/json'.format(debugging_port)
            resp = requests.get(url).json()
            ws_url = resp[0]['webSocketDebuggerUrl']
            return browser_proc, create_connection(ws_url)
        except requests.exceptions.ConnectionError:
            time.sleep(sleep_step)
            wait_seconds -= sleep_step
    raise Exception('Unable to connect to chrome')

request_id = 0

def run_command(conn, method, **kwargs):
    global request_id
    request_id += 1
    command = {'method': method,
               'id': request_id,
               'params': kwargs}
    conn.send(json.dumps(command))
    while True:
        msg = json.loads(conn.recv())
        if msg.get('id') == request_id:
            return msg

gnews_url = 'https://news.google.com/news/?ned=us&hl=en'
chrome_path = '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'
browser, conn = start_browser(chrome_path, 9222)
run_command(conn, 'Page.navigate', url=gnews_url)
time.sleep(5) # let it load
js = """
var sel = 'h3 > a';
var headings = document.querySelectorAll(sel);
headings = [].slice.call(headings).map((link)=>{return link.innerText});
JSON.stringify(headings);
"""
result = run_command(conn, 'Runtime.evaluate', expression=js)

headings = json.loads(result['result']['result']['value'])
for heading in headings:
    print(heading)
browser.terminate()
[1211/104953.088401:ERROR:xattr.cc(63)] setxattr org.chromium.crashpad.database.initialized on file /var/folders/4c/21d_62nx5tnf_t9l0fl9jych0000gn/T/: Operation not permitted (1)
[1211/104953.091128:ERROR:file_io.cc(90)] ReadExactly: expected 8, observed 0
[1211/104953.093504:ERROR:xattr.cc(63)] setxattr org.chromium.crashpad.database.initialized on file /var/folders/4c/21d_62nx5tnf_t9l0fl9jych0000gn/T/: Operation not permitted (1)
[1211/104953.205417:ERROR:socket_posix.cc(148)] bind() failed: Address already in use (48)

DevTools listening on ws://[::1]:9222/devtools/browser/f0b4d7d2-2331-4b2a-8e2f-4d60e8e12e0d
With time running out, Trump and GOP allies turn up pressure on Supreme Court in election assault
Second stimulus check updates: McConnell says no GOP support for emerging COVID-19 relief deal
Biden's pick of Denis McDonough for VA sparks pushback from veterans
Hopes dwindle for Northern Lights over parts of the US tonight
Body cam footage shows raid on former Florida Covid data scientist's home
Republican NH House Speaker Dies Of COVID-19
'I literally lost it': Kim Kardashian reacts to Brandon Bernard's scheduled execution, details last phone call
Majority of House GOP support lawsuit aimed at overturning election - Business Insider
Hoped for northern lights in New England a 'big miss,' U.S. space forecaster says
Hillary Clinton says Republicans who 'humor' Trump's election fraud claims 'have no spines'
Inhofe slams Trump administration on Western Sahara policy
Trump administration reportedly sanctioning Turkey over S-400 - Business Insider
Chinese citizen journalist detained for reporting on Wuhan coronavirus outbreak "may not survive"
Spain Evicts Francisco Franco's Heirs From Late Dictator's Summer Palace
kensoh commented 3 years ago

following simple TagUI script base on above, works in normal mode,

https://news.google.com/news/?ned=us&hl=en
wait
dom return document.evaluate('//h3/a',document,null,XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,null).snapshotItem(0).innerText
echo `dom_result`

but throws the same error using headless mode -

WebSocket\ConnectionException: Empty read; connection dead?  Stream state: {"timed_out":false,"blocked":true,"eof":true,"stream_type":"tcp_socket\/ssl","mode":"r+","unread_bytes":0,"seekable":false} in /Users/kensoh/Cloud Drive/Marketing/Website/api/tagui/src/ws/Base.php:269
Stack trace:
#0 /Users/kensoh/Cloud Drive/Marketing/Website/api/tagui/src/ws/Base.php(143): WebSocket\Base->read(2)
#1 /Users/kensoh/Cloud Drive/Marketing/Website/api/tagui/src/ws/Base.php(135): WebSocket\Base->receive_fragment()
#2 /Users/kensoh/Cloud Drive/Marketing/Website/api/tagui/src/tagui_chrome.php(48): WebSocket\Base->receive()
#3 {main}
kensoh commented 3 years ago

consider the following simple script

https://news.google.com/news/?ned=us&hl=en
chrome_step('Runtime.evaluate',{expression: 'document.title'});
chrome_step('Runtime.evaluate',{expression: 'document.title'});
chrome_step('Runtime.evaluate',{expression: 'document.title'});
chrome_step('Runtime.evaluate',{expression: 'document.title'});
chrome_step('Runtime.evaluate',{expression: 'document.title'});
chrome_step('Runtime.evaluate',{expression: 'document.title'});
chrome_step('Runtime.evaluate',{expression: 'document.title'});
chrome_step('Runtime.evaluate',{expression: 'document.title'});
chrome_step('Runtime.evaluate',{expression: 'document.title'});

the same simple calls can lead to persistent timeouts after a couple of times

[tagui] START  - listening for inputs

[tagui] INPUT  - 
[1] {"id":1,"method":"Page.setDownloadBehavior","params":{"behavior":"allow","downloadPath":"/Users/kensoh/Desktop"}}
TEST - {"id":1,"result":{}}

[tagui] OUTPUT - 
[1] {"id":1,"result":{}}

[tagui] INPUT  - 
[2] {"id":2,"method":"Page.navigate","params":{"url":"https://news.google.com/news/?ned=us&hl=en"}}
TEST - {"id":2,"result":{"frameId":"AAD7ABF6A619E2C079F45FAEC16EE19C","loaderId":"10C8D76494D16867EE469A872C4FE419"}}

[tagui] OUTPUT - 
[2] {"id":2,"result":{"frameId":"AAD7ABF6A619E2C079F45FAEC16EE19C","loaderId":"10C8D76494D16867EE469A872C4FE419"}}

[tagui] INPUT  - 
[3] {"id":3,"method":"Runtime.evaluate","params":{"expression":"document.title"}}
TEST - {"id":3,"result":{"result":{"type":"string","value":"Google News"}}}

[tagui] OUTPUT - 
[3] {"id":3,"result":{"result":{"type":"string","value":"Google News"}}}

[tagui] INPUT  - 
[4] {"id":4,"method":"Runtime.evaluate","params":{"expression":"document.title"}}
TEST - {"id":4,"result":{"result":{"type":"string","value":"Google News"}}}

[tagui] OUTPUT - 
[4] {"id":4,"result":{"result":{"type":"string","value":"Google News"}}}

[tagui] INPUT  - 
[5] {"id":5,"method":"Runtime.evaluate","params":{"expression":"document.title"}}
TEST - WebSocket\ConnectionException: Empty read; connection dead?  Stream state: {"timed_out":true,"blocked":true,"eof":false,"stream_type":"tcp_socket\/ssl","mode":"r+","unread_bytes":0,"seekable":false} in /Users/kensoh/Cloud Drive/Marketing/Website/api/tagui/src/ws/Base.php:269
Stack trace:
#0 /Users/kensoh/Cloud Drive/Marketing/Website/api/tagui/src/ws/Base.php(143): WebSocket\Base->read(2)
#1 /Users/kensoh/Cloud Drive/Marketing/Website/api/tagui/src/ws/Base.php(135): WebSocket\Base->receive_fragment()
#2 /Users/kensoh/Cloud Drive/Marketing/Website/api/tagui/src/tagui_chrome.php(48): WebSocket\Base->receive()
#3 {main}

TEST - {"id":5,"result":{"result":{"type":"string","value":"Google News"}}}

[tagui] OUTPUT - 
[5] {"id":5,"result":{"result":{"type":"string","value":"Google News"}}}

[tagui] INPUT  - 
[6] {"id":6,"method":"Runtime.evaluate","params":{"expression":"document.title"}}
TEST - WebSocket\ConnectionException: Empty read; connection dead?  Stream state: {"timed_out":true,"blocked":true,"eof":false,"stream_type":"tcp_socket\/ssl","mode":"r+","unread_bytes":0,"seekable":false} in /Users/kensoh/Cloud Drive/Marketing/Website/api/tagui/src/ws/Base.php:269
Stack trace:
#0 /Users/kensoh/Cloud Drive/Marketing/Website/api/tagui/src/ws/Base.php(143): WebSocket\Base->read(2)
#1 /Users/kensoh/Cloud Drive/Marketing/Website/api/tagui/src/ws/Base.php(135): WebSocket\Base->receive_fragment()
#2 /Users/kensoh/Cloud Drive/Marketing/Website/api/tagui/src/tagui_chrome.php(48): WebSocket\Base->receive()
#3 {main}
kensoh commented 3 years ago

doing a wait to throttle the requests will see the same timeout messages, but still seeing response after some time -

https://news.google.com/news/?ned=us&hl=en
wait 10 seconds
chrome_step('Runtime.evaluate',{expression: 'document.title'});
wait 10 seconds
chrome_step('Runtime.evaluate',{expression: 'document.title'});
wait 10 seconds
chrome_step('Runtime.evaluate',{expression: 'document.title'});
wait seconds
chrome_step('Runtime.evaluate',{expression: 'document.title'});
kensoh commented 3 years ago

Some clues found. Error happens in headless mode when the user profile directory is a relative path -

--user-data-dir=chrome/tagui_user_profile

However, above in tagui/src/tagui works in normal visible mode.

And changing the relative path to absolute path makes it work for headless mode.

Ie some difference in behaviour for newer versions of Chrome in headless mode.

More references on using full path name -

kensoh commented 3 years ago

Adding on below what I shared with Chrome Remote Interface (another project using DevTools Protocol) maintainer -

It seems like a situation unique with my implementation for TagUI. What happens is in headless mode, when I provide --user-data-dir= with a relative path it no longer works, when it used to work in the past 2 years. When I tweak the relative path provided into a full path, it works in headless mode. For visible mode, it works whether relative or absolute path is provided.

Something probably has changed with how headless Chrome behaves when the path provided is a relative path. I'll close this issue because I don't think it happens outside of the TagUI implementation. I tried replicating the issue using Python websocket but it can't be replicated. So the fix has to be an updated implementation for TagUI headless Chrome.

kensoh commented 3 years ago

above commit fixes headless Chrome to work on macOS and Linux. to check status for Windows and see if fix required

kensoh commented 3 years ago

Above commits fix headless Chrome to work on Windows. So headless now working for all OSes.

Users can download the latest copy of TagUI from here and unzip to overwrite your existing installation (please drag the folders under tagui\src to overwrite your existing installation) - https://github.com/kelaberetiv/TagUI/archive/master.zip

In the next release, this fix will be part of the packaged zip files.

kensoh commented 3 years ago

Closing issue since the latest packaged release TagUI v6.14 is out.

Release notes - https://github.com/kelaberetiv/TagUI/releases/tag/v6.14.0 To download v6.14 - https://tagui.readthedocs.io/en/latest/setup.html Documentation - https://tagui.readthedocs.io/en/latest/index.html