ClericPy / ichrome

Chrome controller for Humans, based on Chrome Devtools Protocol(CDP) and python3.7+.
https://pypi.org/project/ichrome/
MIT License
227 stars 29 forks source link

Multiple scrapy spiders that need a shared browser #129

Closed juanfrilla closed 1 year ago

juanfrilla commented 1 year ago

I'm working with multiple scrapy spiders and some of them need a headless browser, now I'm using selenium but If I run a selenium chromedriver and a browser per spider it will consume a lot of resources.

I'm thinking to use ichrome to solve this problem, I mean, having a shared browser for everyone of the scrapy spiders that needs to make use of a browser. Can every spider that need a headless browser open a tab on this ichrome shared browser? Do this shared browser needs to be always switched on?

Thanks

ClericPy commented 1 year ago

try python -m ichrome.web, ichrome.web is EXPERIMENTAL but may give you a hand.

read the quick start python -m ichrome.web --help

usage:
>>> python -m ichrome.web

view urls with your browser

http://127.0.0.1:8080/chrome/screenshot?url=http://bing.com

http://127.0.0.1:8080/chrome/download?url=http://bing.com

http://127.0.0.1:8080/chrome/preview?url=http://bing.com

options:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        load config dict from JSON file of given path to overwrite other args, default Config JSON: {"IncludeRouterArgs": {"prefix": "/chrome"}, "UvicornArgs": {"host": "127.0.0.1", "port": 8080}, "ChromeAPIRouterArgs": {"start_port":
                        9345, "workers_amount": 1, "max_concurrent_tabs": 5, "headless": true, "extra_config": ["--window-size=800,600"]}, "ChromeWorkerArgs": {"RESTART_EVERY": 480, "DEFAULT_CACHE_SIZE": 104857600}}
  -H HOST, --host HOST  uvicorn host, default to 127.0.0.1
  -p PORT, --port PORT  uvicorn port, default to 8080
  --prefix PREFIX       Fastapi.include_router.prefix
  -sp START_PORT, --start-port START_PORT
                        ChromeAPIRouterArgs.start_port
  -w WORKERS_AMOUNT, --workers WORKERS_AMOUNT, --workers-amount WORKERS_AMOUNT
                        ChromeAPIRouterArgs.workers_amount
  --max-concurrent-tabs MAX_CONCURRENT_TABS
                        ChromeAPIRouterArgs.max_concurrent_tabs
  --restart-every RESTART_EVERY
                        ChromeWorker.RESTART_EVERY
  --default-cache-size DEFAULT_CACHE_SIZE
                        ChromeWorker.DEFAULT_CACHE_SIZE
  -cp CHROME_PATH, --chrome-path CHROME_PATH, --chrome_path CHROME_PATH
                        chrome executable file path, default to null(automatic searching)
  --disable-headless    disable --headless arg for chrome
  -U USER_DATA_DIR, --user-data-dir USER_DATA_DIR, --user_data_dir USER_DATA_DIR
                        user_data_dir to save user data, default to ~/ichrome_user_data
  --disable-image, --disable_image
                        disable image for loading performance, default to False
juanfrilla commented 1 year ago

Thanks i'll give it a try

juanfrilla commented 1 year ago

@ClericPy Ok, so I open a browser, And how can I Connect to that existing browser from a Spider and open a tab? Can that tab opened by the shared browser use a proxy?

ClericPy commented 1 year ago

from torequests import tPool
from inspect import getsource
req = tPool()

async def tab_callback(self, tab, data, timeout):
    await tab.set_url(data['url'], timeout=timeout)
    return (await tab.querySelector('h1')).text

r = req.post('http://127.0.0.1:8000/chrome/do',
             json={
                 'data': {
                     'url': 'http://httpbin.org/html'
                 },
                 'tab_callback': getsource(tab_callback),
                 'timeout': 10
             })
print(r.text)
# "Herman Melville - Moby-Dick"

# incognito_args demo

async def tab_callback(task, tab, data, timeout):
    await tab.wait_loading(3)
    return await tab.html

print(
    requests.post('http://127.0.0.1:8000/chrome/do',
                  json={
                      'tab_callback': getsource(tab_callback),
                      'timeout': 10,
                      'incognito_args': {
                          'url': 'http://httpbin.org/ip',
                          'proxyServer': 'http://127.0.0.1:1080'
                      }
                  }).text)

read the source code for more info https://github.com/ClericPy/ichrome/blob/master/ichrome/routers/fastapi_routes.py

juanfrilla commented 1 year ago

Amazing