HENNGE / arsenic

Async WebDriver implementation for asyncio and asyncio-compatible frameworks
Other
350 stars 53 forks source link

Chrome Headless HTML broken with Proxy #39

Open Gelbpunkt opened 5 years ago

Gelbpunkt commented 5 years ago
from arsenic import get_session
from arsenic.browsers import Chrome
from arsenic.services import Chromedriver
from os import devnull
from async_timeout import timeout

service = Chromedriver(log_file=devnull)
browser = Chrome(chromeOptions={ 'args': ['--headless', '--disable-gpu', '--hide-scrollbars', '--window-size=1920,1080', '--disable-gpu', '--remote-debugging-port=9222', '--proxy-server="socks5://127.0.0.1:9050"', '--host-resolver-rules="MAP * ~NOTFOUND , EXCLUDE localhost"' ] })
try:
    async with timeout(10):
        async with get_session(service, browser) as session:
            await session.get("https://example.com")
            print(await session.get_page_source())
except asyncio.TimeoutError:
    print("Took too long to take screenshot.")

This code always returns <html xmlns="http://www.w3.org/1999/xhtml"><head></head><body></body></html> and screenshots are completely white.

However, it works without the proxy option. When we tried switching to Firefox we got this:

from arsenic import get_session
from arsenic.browsers import Firefox
from arsenic.services import Geckodriver
from os import devnull
from async_timeout import timeout

service = Geckodriver(log_file=devnull)
browser = Firefox(firefoxOptions={ 'args': ['-headless'] })
try:
    async with timeout(10):
        async with get_session(service, browser) as session:
            await session.get("http://idlerpg.fun")
            image = await session.get_screenshot()
except asyncio.TimeoutError:
    return print("Took too long to take screenshot.")
image.seek(0)
2018-08-14 14:12.33 request                        body={"desiredCapabilities": {"browserName": "firefox", "marionette": true, "acceptInsecureCerts": true, "firefoxOptions": {"args": ["-headless"]}}} method=POST url=http://localhost:55423/session
2018-08-14 14:12.33 response                       body={"desiredCapabilities": {"browserName": "firefox", "marionette": true, "acceptInsecureCerts": true, "firefoxOptions": {"args": ["-headless"]}}} data={'value': {'error': 'unknown error', 'message': 'Process unexpectedly closed with status 1', 'stacktrace': ''}} method=POST response=<ClientResponse(http://localhost:55423/session) [500 Internal Server Error]>
<CIMultiDictProxy('Content-Type': 'application/json; charset=utf-8', 'Cache-Control': 'no-cache', 'Content-Length': '105', 'Date': 'Tue, 14 Aug 2018 14:12:33 GMT')>
 url=http://localhost:55423/session
Traceback (most recent call last):
  File "/home/travitia/production/Travitia/cogs/owner.py", line 102, in _eval
    ret = await func()
  File "<string>", line 12, in func
  File "/usr/local/lib/python3.6/dist-packages/arsenic/__init__.py", line 16, in __aenter__
    self.session = await start_session(self.service, self.browser, self.bind)
  File "/usr/local/lib/python3.6/dist-packages/arsenic/__init__.py", line 29, in start_session
    return await driver.new_session(browser, bind=bind)
  File "/usr/local/lib/python3.6/dist-packages/arsenic/webdriver.py", line 57, in new_session
    raise SessionStartError(err_resp['error'], err_resp.get('message', ''), original_response)
arsenic.errors.SessionStartError: unknown error: Process unexpectedly closed with status 1

Any idea how to fix this? Chrome is v 68 (latest)

Diniboy1123 commented 5 years ago

I can reproduce it too, the images are all blank and the source is the same...

However running directly google-chrome --proxy-server="socks5://127.0.0.1:9050" --disable-gpu --headless --screenshot https://google.com produces a correct image output. Is it a lib or a Chromedriver issue?

ojii commented 5 years ago

To help me reproduce this, could you please tell me what proxy software you're using?

Gelbpunkt commented 5 years ago

We're using Tor / Torsocks

ojii commented 5 years ago

Also, does it work in non-headless mode behind a proxy?

Gelbpunkt commented 5 years ago

Yep, removing either proxy or headless works fine.

Gelbpunkt commented 5 years ago

As @Diniboy1123 said, using the chrome CLI works fine and produces the result that is expected.

Using google.com: https://cdn.discordapp.com/attachments/466937541552111616/479158940215672833/screenshot.png