pythonic-shk commented 1 year ago

I am trying to web scrape my companies website. To speed up, I have used Asynchronous Web Scraping library Arsenic. When I run this Code I see multiple drivers are spawned at different local ports.

Starting ChromeDriver 109.0.5414.74 (e7c5703604daa9cc128ccf5a5d3e993513758913-refs/branch-heads/5414@{#1172}) on port 59479 Only local connections are allowed. Please see https://chromedriver.chromium.org/security-considerations for suggestions on keeping ChromeDriver safe. [1674821791.415][SEVERE]: bind() failed: Cannot assign requested address (99) ChromeDriver was started successfully. Starting ChromeDriver 109.0.5414.74 (e7c5703604daa9cc128ccf5a5d3e993513758913-refs/branch-heads/5414@{#1172}) on port 40633 Only local connections are allowed. Please see https://chromedriver.chromium.org/security-considerations for suggestions on keeping ChromeDriver safe. [1674821791.853][SEVERE]: bind() failed: Cannot assign requested address (99) ChromeDriver was started successfully.

after scraping some urls it is giving an Error, which I am not able to understand.

`2023-01-27 12:16.44 [error ] error data={'error': 'unknown error', 'message': 'unknown error: net::ERR_CONNECTION_CLOSED\n (Session info: headless chrome=109.0.5414.119)', 'stacktrace': '#0 0x55e6edd7e303 \n#1 0x55e6edb52d37 \n#2 0x55e6edb4ad85 \n#3 0x55e6edb3df87 \n#4 0x55e6edb3f4e9 \n#5 0x55e6edb3e2fe \n#6 0x55e6edb3d432 \n#7 0x55e6edb3d285 \n#8 0x55e6edb3bc77 \n#9 0x55e6edb3c2a4 \n#10 0x55e6edb54c48 \n#11 0x55e6edbc7f15 \n#12 0x55e6edbaf982 \n#13 0x55e6edbc788c \n#14 0x55e6edbaf753 \n#15 0x55e6edb82a14 \n#16 0x55e6edb83b7e \n#17 0x55e6eddcd32e \n#18 0x55e6eddd0c0e \n#19 0x55e6eddb3610 \n#20 0x55e6eddd1c23 \n#21 0x55e6edda5545 \n#22 0x55e6eddf26a8 \n#23 0x55e6eddf2836 \n#24 0x55e6ede0dd13 \n#25 0x7fae53b0fea5 start_thread\n'} message=unknown error: net::ERR_CONNECTION_CLOSED (Session info: headless chrome=109.0.5414.119) stacktrace=#0 0x55e6edd7e303

1 0x55e6edb52d37

2 0x55e6edb4ad85

3 0x55e6edb3df87

4 0x55e6edb3f4e9

5 0x55e6edb3e2fe

6 0x55e6edb3d432

7 0x55e6edb3d285

8 0x55e6edb3bc77

9 0x55e6edb3c2a4

10 0x55e6edb54c48

11 0x55e6edbc7f15

12 0x55e6edbaf982

13 0x55e6edbc788c

14 0x55e6edbaf753

15 0x55e6edb82a14

16 0x55e6edb83b7e

17 0x55e6eddcd32e

18 0x55e6eddd0c0e

19 0x55e6eddb3610

20 0x55e6eddd1c23

21 0x55e6edda5545

22 0x55e6eddf26a8

23 0x55e6eddf2836

24 0x55e6ede0dd13

25 0x7fae53b0fea5 start_thread

status=500 type=<class 'arsenic.errors.UnknownError'> failed getting session`

I am running this in Docker using Linux RHEL 7 image. Python 3.8 Arsenic 21.8 Chrome v109 ChromeDriver v109

code:

from arsenic import get_session, stop_session, browsers, services

def initialize_webdriver():
    service = services.Chromedriver(binary=os.environ.get('CHROMEDRIVER_PATH'))
    browser = browsers.Chrome()
    browser.capabilities = {
        "goog:chromeOptions": {"args": ["--no-sandbox", "--headless", "--verbose",
                                        "--disable-gpu", "--disable-web-security", "--allow_insecure_localhost",
                                        "--disable-dev-shm-usage", "--enable-javascript"
                                        ]
                          }
    }
    return service, browser

async def scraper(limit, service, browser, url):
    async with limit:
        try:
            async with get_session(service, browser) as session:
                # print("inside scraper")
                await session.get(url)
                try:
                   <code to get web elements>
                  return results
               except asyncio.TimeoutError as msg:
                    print("failed scraping url ", url)
                    await stop_session(session)
                    print(msg)
                    return []
        except (arsenic.errors.UnknownArsenicError, arsenic.errors.UnknownError, arsenic.errors.ArsenicError)as msg:
            print("failed getting session")
            global failed_urls
            failed_urls.append(urls)
            limit.release()

            return []

async def run(service, browser, urls):
    limit = asyncio.Semaphore(30)
    results = await asyncio.gather(*[scraper(limit, service, browser,
                                                 url) for url in urls)])
    print(results)

if __name__ == "__main__":
    failed_urls = []
    urls = extract_urls() # it collects urls from website's sitemap url
    service, browser = initialize_webdriver()
    asyncio.run(run(service, browser, urls))

After reducing the semaphore to 20, I am getting the same issue. Need to understand why this error is occurring and how to resolve this.

dimaqq commented 1 year ago

Doesn't ChromeDriver, by default, only allow connections from localhost? The log shows that it's being accessed over an ipv4 address instead, and the helpful URL that chromedriver adds to the log suggests using --allowed-ips to allowlist your arsenic host.

pythonic-shk commented 1 year ago

Both the Chromedriver and Chrome browser are locally installed in the Docker Image. I am able to scrape for almost 50 percent of the webpages. Some where in the middle I get the ERR_CONNECTION_CLOSED error and because of it I get asyncio.futures.TimeoutError: and the container exits. What is your recommendation in this case

pythonic-shk commented 1 year ago

@dimaqq I am getting various errors when trying to obtain session ERR_CONNECTION_CLOSED ERR_PROXY_CONNECTION_FAILED unknown error: cannot kill Chrome DevToolsActivePort file doesn't exist while trying to initiate Chrome Browser unknown error: Chrome crashed

One of the errors above always occur.

Also when stop_session is initiated at exit. It is sometimes not able to terminate the process (Subprocess) and I get a warning

Although I handle all these errors, The script becomes very slow after a point.

I am testing it with around 2000 urls (All urls belong to same domain) inside docker with RHEL 7 Image. I am trying to run 10 webdriver sessions at a time.

So, Do I need to be aware of any limitations of using this package and minimum resources required to run this without any issues.

dimaqq commented 1 year ago

This is an open source project. You’re welcome to make it better.

HENNGE / arsenic

Getting 'unknown error: net::ERR_CONNECTION_CLOSED' Error #159

1 0x55e6edb52d37

2 0x55e6edb4ad85

3 0x55e6edb3df87

4 0x55e6edb3f4e9

5 0x55e6edb3e2fe

6 0x55e6edb3d432

7 0x55e6edb3d285

8 0x55e6edb3bc77

9 0x55e6edb3c2a4

10 0x55e6edb54c48

11 0x55e6edbc7f15

12 0x55e6edbaf982

13 0x55e6edbc788c

14 0x55e6edbaf753

15 0x55e6edb82a14

16 0x55e6edb83b7e

17 0x55e6eddcd32e

18 0x55e6eddd0c0e

19 0x55e6eddb3610

20 0x55e6eddd1c23

21 0x55e6edda5545

22 0x55e6eddf26a8

23 0x55e6eddf2836

24 0x55e6ede0dd13

25 0x7fae53b0fea5 start_thread