binux / pyspider

A Powerful Spider(Web Crawler) System in Python.
http://docs.pyspider.org/
Apache License 2.0
16.51k stars 3.69k forks source link

How to config the redis to the message-queue? #730

Open dongrixinyu opened 7 years ago

dongrixinyu commented 7 years ago

I want to link redis to pyspider as the database to store messages sent from the fetcher and processor, but failed . cause i dont know how to set the redis.

here is my code:

`mysql: image: 'mysql:latest' container_name: mysql environment:

redis: image: 'redis:latest' container_name: redis restart: always ports:

scheduler: image: 'cecil/pyspider:add_mysql' container_name: scheduler links:

fetcher: image: cecil/pyspider:add_mysql command: '--message-queue "redis://192.168.0.137:6379/1" --phantomjs-proxy "phantomjs:80" fetcher --xmlrpc' cpu_shares: 512 environment:

fetcher-lb: image: 'dockercloud/haproxy:latest' links:

phantomjs: image: 'cecil/pyspider:add_mysql' command: phantomjs cpu_shares: 512 environment:

phantomjs-lb: image: 'dockercloud/haproxy:latest' links:

processor: image: 'cecil/pyspider:add_mysql' command: '--projectdb "mysql+projectdb://root:123456@192.168.0.137:3306/projectdb" --message-queue "redis://192.168.0.137:6379/2" processor' cpu_shares: 512 mem_limit: 256m restart: always

processor-lb: image: 'dockercloud/haproxy:latest' links:

result-worker: image: 'cecil/pyspider:add_mysql' container_name: result-worker command: '--taskdb "mysql+taskdb://root:123456@192.168.0.137:3306/taskdb" --projectdb "mysql+projectdb://root:123456@192.168.0.137:3306/projectdb" --resultdb "mysql+resultdb://root:123456@192.168.0.137:3306/resultdb" --message-queue "redis://192.168.0.137:6379/4" result_worker' cpu_shares: 512 mem_limit: 256m restart: always

webui: image: 'cecil/pyspider:add_mysql' container_name: webui command: '--taskdb "mysql+taskdb://root:123456@192.168.0.137:3306/taskdb" --projectdb "mysql+projectdb://root:123456@192.168.0.137:3306/projectdb" --resultdb "mysql+resultdb://root:123456@192.168.0.137:3306/resultdb" --message-queue "redis://192.168.0.137:6379/5" webui --username "cecil" --password "123456" --need-auth --max-rate 0.3 --max-burst 2 --scheduler-rpc "http://192.168.0.137:23333/" --fetcher-rpc "http://fetcher/"' cpu_shares: 512 environment:

Actually, no messages sent to the redis database when i check the redis-management-client, and the number of fetcher, processor, scheduler in the dashboard is 0 seperately.

please focus on the --message-queue "redis://192.168.0.137:6379/5" --message-queue "redis://192.168.0.137:6379/4" result_worker .... I dont know whether the database in the end of the message-queue configaration could change according to different docker containers. That is what confuses me!

thx very much for any help or advice!

binux commented 7 years ago

First you seems exposed the port of redis to host, but the host IP may not accessible from docker image except you are running image on host network. The correct way to linking redis is adding links to all other pyspider components, and use the environment variable in image to get the redis mounted hostname. (Please refer to docker's documentation for this part)

Second you must config pyspider to use a some redis index in order to let them communicate.

On Fri, 25 Aug 2017, 13:38 dongrixinyu, notifications@github.com wrote:

  • pyspider version: the latest version pulled from docker image in 2017/8/24
  • Operating system: centOS 7.2
  • Start up command: docker-compose.yml to start up the crawler distributed by docker

I want to link redis to pyspider as the database to store messages sent from the fetcher and processor, but failed . cause i dont know how to set the redis.

here is my code:

`mysql: image: 'mysql:latest' container_name: mysql environment:

  • LANG=C.UTF-8
  • MYSQL_ROOT_PASSWORD=123456 ports:
  • "192.168.0.137:3306:3306"

redis: image: 'redis:latest' container_name: redis restart: always ports:

  • "192.168.0.137:6379:6379"

scheduler: image: 'cecil/pyspider:add_mysql' container_name: scheduler links:

  • mysql:mysql
  • redis:redis command: '--taskdb "mysql+taskdb://root:123456@192.168.0.137:3306/taskdb" --projectdb "mysql+projectdb://root:123456@192.168.0.137:3306/projectdb" --resultdb "mysql+resultdb://root:123456@192.168.0.137:3306/resultdb" --message-queue "redis://192.168.0.137:6379/3" scheduler --inqueue-limit 5000 --delete-time 43200' ports:
  • "192.168.0.137:23333:23333" restart: always

fetcher: image: cecil/pyspider:add_mysql command: '--message-queue "redis://192.168.0.137:6379/1" --phantomjs-proxy "phantomjs:80" fetcher --xmlrpc' cpu_shares: 512 environment:

  • 'EXCLUDE_PORTS=5000,25555,23333' links:
  • 'phantomjs-lb:phantomjs' mem_limit: 128m restart: always

fetcher-lb: image: 'dockercloud/haproxy:latest' links:

  • fetcher restart: always

phantomjs: image: 'cecil/pyspider:add_mysql' command: phantomjs cpu_shares: 512 environment:

  • 'EXCLUDE_PORTS=5000,23333,24444' expose:
  • '25555' mem_limit: 512m restart: always

phantomjs-lb: image: 'dockercloud/haproxy:latest' links:

  • phantomjs restart: always

processor: image: 'cecil/pyspider:add_mysql' command: '--projectdb "mysql+projectdb:// root:123456@192.168.0.137:3306/projectdb" --message-queue "redis:// 192.168.0.137:6379/2" processor' cpu_shares: 512 mem_limit: 256m restart: always

processor-lb: image: 'dockercloud/haproxy:latest' links:

  • processor restart: always

result-worker: image: 'cecil/pyspider:add_mysql' container_name: result-worker command: '--taskdb "mysql+taskdb://root:123456@192.168.0.137:3306/taskdb" --projectdb "mysql+projectdb://root:123456@192.168.0.137:3306/projectdb" --resultdb "mysql+resultdb://root:123456@192.168.0.137:3306/resultdb" --message-queue "redis://192.168.0.137:6379/4" result_worker' cpu_shares: 512 mem_limit: 256m restart: always

webui: image: 'cecil/pyspider:add_mysql' container_name: webui command: '--taskdb "mysql+taskdb://root:123456@192.168.0.137:3306/taskdb" --projectdb "mysql+projectdb://root:123456@192.168.0.137:3306/projectdb" --resultdb "mysql+resultdb://root:123456@192.168.0.137:3306/resultdb" --message-queue "redis://192.168.0.137:6379/5" webui --username "cecil" --password "123456" --need-auth --max-rate 0.3 --max-burst 2 --scheduler-rpc "http://192.168.0.137:23333/" --fetcher-rpc " http://fetcher/"' cpu_shares: 512 environment:

  • 'EXCLUDE_PORTS=24444,25555,23333' links:
  • 'fetcher-lb:fetcher' mem_limit: 256m restart: always ports:
  • "192.168.0.137:5000:5000"`

Actually, no messages sent to the redis database when i check the redis-management-client, and the number of fetcher, processor, scheduler in the dashboard is 0 seperately.

please focus on the --message-queue "redis://192.168.0.137:6379/5" --message-queue "redis://192.168.0.137:6379/4" result_worker .... I dont know whether the database in the end of the message-queue configaration could change according to different docker containers. That is what confuses me!

thx very much for any help or advice!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/binux/pyspider/issues/730, or mute the thread https://github.com/notifications/unsubscribe-auth/AAndM3X_KvBqj3BTPDIs7CjT_9dKqBlcks5sbsA9gaJpZM4PCm9O .

dongrixinyu commented 7 years ago

Thx very much for your help although I’m new to web and still a little bit confused about the environment variable in image.

I have another question on using pyspider with docker: Whether need I to override "on_result" method to get access to the mysql database ? Because when running a project, I can get and see all the data crawled from web in the navicat without any change in the default script. I assume rewriting the mysql method is not nessesary except changing the database structure.

By the way, there is another point i like to share with you. I pulled the latest version of binux/pyspider, mysql and redis from docker image repertary a week ago. And I have to add a MySQL-python package to the binux/pyspider container to make a new image,thus not causing any error about linking mysql to pyspider. I dont know why but solved it.

Thx again for your instruction!

binux commented 7 years ago

Whether override on-result depends on your need, if you don't need to change the default scheme or don't need to do post processing, override is unnecessary.

pyspider is using mysql-connector in docker images.

On Mon, 28 Aug 2017, 13:08 dongrixinyu, notifications@github.com wrote:

Thx very much for your help although I’m new to web and still a little bit confused about the environment variable in image.

I have another question on using pyspider with docker: Whether need I to override "on_result" method to get access to the mysql database ? Because when running a project, I can get and see all the data crawled from web in the navicat without any change in the default script. I assume rewriting the mysql method is not nessesary except changing the database structure.

By the way, there is another point i like to share with you. I pulled the latest version of binux/pyspider, mysql and redis from docker image repertary a week ago. And I have to add a MySQL-python package to the binux/pyspider container to make a new image,thus not causing any error about linking mysql to pyspider. I dont know why but solved it.

Thx again for your instruction!

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/binux/pyspider/issues/730#issuecomment-325335956, or mute the thread https://github.com/notifications/unsubscribe-auth/AAndMw8ldBT1i6R9A1jgkP1SC1TRXqo6ks5scq3SgaJpZM4PCm9O .