binux / pyspider

A Powerful Spider(Web Crawler) System in Python.
http://docs.pyspider.org/
Apache License 2.0
16.51k stars 3.69k forks source link

How to use squid with docker for pyspider ? #833

Open djytwy opened 6 years ago

djytwy commented 6 years ago

When I using pyspider with docker , I want to use proxy with pyspider. How can I use docker for squid to provide proxy to pyspider?

This is my docker-compose.yml:

phantomjs:
    image: 'daocloud.io/djytwy/pyspider:latest'
    command: phantomjs
    cpu_shares: 512
    environment:
        - 'EXCLUDE_PORTS=5000,23333,24444,6666,22222'
    expose:
        - '25555'
    mem_limit: 512m
    restart: always
phantomjs-lb:
    image: 'dockercloud/haproxy:latest'
    links:
      - phantomjs
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    restart: always

chromium:
    image: 'daocloud.io/djytwy/pyspider:latest'
    command: chromium
    environment:
      - 'EXCLUDE_PORTS=5000,23333,24444,6666,25555'
    expose:
      - '22222'
    restart: always
chromium-lb:
    image: 'dockercloud/haproxy:latest'
    links:
      - chromium
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    restart: always

squid:
    image: 'docker.io/sameersbn/squid:latest'
    environment:
      - 'EXCLUDE_PORTS=5000,23333,24444,22222'
    expose:
      - '6666'
    volumes:
      - /etc/squid/squid.conf:/etc/squid/squid.conf 
      - /etc/squid/peers.conf:/etc/squid/peers.conf 
    restart: always

fetcher:
    image: 'daocloud.io/djytwy/pyspider:latest'
    command: '--message-queue "amqp://guest:guest@172.17.0.3:5672/" --phantomjs-proxy "phantomjs:80" --chromium-proxy "chromium:80" fetcher --xmlrpc'
    cpu_shares: 512
    environment:
      - 'EXCLUDE_PORTS=5000,25555,23333,22222,6666'
    links:
      - 'phantomjs-lb:phantomjs'
      - 'chromium-lb:chromium'
      - 'squid'
    mem_limit: 128m
    restart: always
fetcher-lb:
    image: 'dockercloud/haproxy:latest'
    links:
      - fetcher
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    restart: always

processor:
    image: 'daocloud.io/djytwy/pyspider:latest'
    command: '--projectdb "sqlalchemy+postgresql+projectdb://postgres:123456@172.17.0.2:5432/projectdb" --message-queue "amqp://guest:guest@172.17.0.3:5672/" processor'
    cpu_shares: 512
    mem_limit: 256m
    restart: always

result-worker:
    image: 'daocloud.io/djytwy/pyspider:latest'
    command: '--taskdb "sqlalchemy+postgresql+taskdb://postgres:123456@172.17.0.2:5432/taskdb"  --projectdb "sqlalchemy+postgresql+projectdb://postgres:123456@172.17.0.2:5432/projectdb" --resultdb "sqlalchemy+postgresql+resultdb://postgres:123456@172.17.0.2:5432/resultdb" --message-queue "amqp://guest:guest@172.17.0.3:5672/" result_worker'
    cpu_shares: 512
    mem_limit: 256m
    restart: always

webui:
    image: 'daocloud.io/djytwy/pyspider:latest'
    command: '--taskdb "sqlalchemy+postgresql+taskdb://postgres:123456@172.17.0.2:5432/taskdb"  --projectdb "sqlalchemy+postgresql+projectdb://postgres:123456@172.17.0.2:5432/projectdb" --resultdb "sqlalchemy+postgresql+resultdb://postgres:123456@172.17.0.2:5432/resultdb" --message-queue "amqp://guest:guest@172.17.0.3:5672/" webui --max-rate 0.2 --max-burst 3 --scheduler-rpc "http://172.17.0.5:23333/" --fetcher-rpc "http://fetcher/"'
    cpu_shares: 256
    environment:
      - 'EXCLUDE_PORTS=24444,25555,23333,22222,6666'
    links:
      - fetcher-lb:fetcher
    ports:
      - 80:5000
    mem_limit: 256m
    restart: always
webui-lb:
    image: 'dockercloud/haproxy:latest'
    links:
      - webui
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    restart: always

Like pyspider-demo, scheduler is running without other components.

docker run --name scheduler -d -p 23333:23333 --restart=always daocloud.io/djytwy/pyspider:latest \
    --taskdb "sqlalchemy+postgresql+taskdb://postgres:123456@172.17.0.2:5432/taskdb" \
    --resultdb "sqlalchemy+postgresql+resultdb://postgres:123456@172.17.0.2:5432/resultdb" \
    --projectdb "sqlalchemy+postgresql+projectdb://postgres:123456@172.17.0.2:5432/projectdb" \
    --message-queue "amqp://guest:guest@172.17.0.3:5672" \
    scheduler --inqueue-limit 5000 --delete-time 43200

172.17.0.1 is docker network Gateway.

This is my squid.conf:

acl SSL_ports port 443
acl Safe_ports port 80          # http
acl Safe_ports port 21          # ftp
acl Safe_ports port 443         # https
acl Safe_ports port 70          # gopher
acl Safe_ports port 210         # wais
acl Safe_ports port 1025-65535  # unregistered ports
acl Safe_ports port 280         # http-mgmt
acl Safe_ports port 488         # gss-http
acl Safe_ports port 591         # filemaker
acl Safe_ports port 777         # multiling http
acl CONNECT method CONNECT
http_access deny !Safe_ports
http_access deny CONNECT !SSL_ports
http_access deny manager

http_access allow all

http_access allow localhost
# http_access deny all
http_port 6666
coredump_dir /var/spool/squid
refresh_pattern ^ftp:           1440    20%     10080
refresh_pattern ^gopher:        1440    0%      1440
refresh_pattern -i (/cgi-bin/|\?) 0     0%      0
refresh_pattern (Release|Packages(.gz)*)$      0       20%     2880
refresh_pattern .               0       20%     4320
#
# INSERT YOUR OWN RULE(S) HERE TO ALLOW ACCESS FROM YOUR CLIENTS
#
via off
forwarded_for off

request_header_access From deny all
request_header_access Server deny all
request_header_access WWW-Authenticate deny all
request_header_access Link deny all
request_header_access Cache-Control deny all
request_header_access Proxy-Connection deny all
request_header_access X-Cache deny all
request_header_access X-Cache-Lookup deny all
request_header_access Via deny all
request_header_access X-Forwarded-For deny all
request_header_access Pragma deny all
request_header_access Keep-Alive deny all

cache_mem 128 MB    
maximum_object_size 16 MB   
cache_dir ufs /var/spool/squid 100 16 256
access_log /var/log/squid/access.log    
visible_hostname www.twy.com     
cache_mgr 676534074@qq.com       

include /etc/squid/peers.conf

never_direct allow all

And my peers.conf:

cache_peer 114.231.159.16 parent 40909 0 round-robin proxy-only no-query connect-fail-limit=2
cache_peer 140.224.110.164 parent 43822 0 round-robin proxy-only no-query connect-fail-limit=2
binux commented 6 years ago

You need to link squid to phantomjs and chromium, otherwise only normal HTTP requests can use proxy.

djytwy commented 6 years ago

Thanks you answer! But I still have some question with squid:

1.Will it use every proxy in the settings?If yes,it's must be too slow.If a useful proxy have only three minutes life in too many useless proxy that will consume a lot of time. If peers.conf write only one proxy,the request speed is too much fast, but in this way squid is look like useless ?

  1. In pyspider, I build two tasks.A task to get proxy other to use the proxy.Like this: ex.png When I rewrite peers.conf, I should reload squid.But squid is running in docker.How to reload squid when squid running in docker only enter the container ?

3.How to use squid provide proxy pyspider with docker ?This is squid in my docker network: squid.png And this is my test_code:

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2018-08-21 09:39:07
# Project: test_proxy

from pyspider.libs.base_handler import *
import os

class Handler(BaseHandler):
    crawl_config = {
         'proxy':'172.17.0.11:6666'
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('https://www.917.com/',callback=self.detail_page)

    @config(priority=2)
    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

This is my operation step: First,I get a surviving proxy.Then I write it to peers.conf (in this time,peers.conf has only one proxy) and reload squid in container with squid.Every thing is ready, but the code is doesn't work well with the Timeout. Thanks again !!!:blush: