binux / pyspider

A Powerful Spider(Web Crawler) System in Python.
http://docs.pyspider.org/
Apache License 2.0
16.48k stars 3.69k forks source link

can't start pyspider using docker with mongo, scheduler gets unknown project #914

Open Neutrino3316 opened 5 years ago

Neutrino3316 commented 5 years ago
# mongo
docker run --name mongo_pyspider -d -p 27017:27017 mongo:latest
# rabbitmq
docker run --name rabbitmq_pyspider -d rabbitmq:latest

# phantomjs
docker run --name pyspider_phantomjs -d binux/pyspider:latest phantomjs

# result worker
docker run --name pyspider_result_worker -d --link mongo_pyspider:mongo --link rabbitmq_pyspider:rabbitmq binux/pyspider:latest result_worker
# processor, run multiple instance if needed.
docker run --name pyspider_processor -d --link mongo_pyspider:mongo --link rabbitmq_pyspider:rabbitmq binux/pyspider:latest processor
# fetcher, run multiple instance if needed.
docker run --name pyspider_fetcher -d --link pyspider_phantomjs:phantomjs --link rabbitmq_pyspider:rabbitmq binux/pyspider:latest fetcher
# scheduler
docker run --name pyspider_scheduler -d --link mongo_pyspider:mongo --link rabbitmq_pyspider:rabbitmq binux/pyspider:latest scheduler
# webui
docker run --name pyspider_webui -d -p 5001:5000 --link mongo_pyspider:mongo --link rabbitmq_pyspider:rabbitmq --link pyspider_scheduler:scheduler --link pyspider_phantomjs:phantomjs binux/pyspider:latest webui

What I did before the bug started

Using the docker run command above to run docker. All docker containers are running normally, everythin seems to be OK.

Webui is running, and I start a new project named as "test", paste the project code as the following:

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

And then back to the dashboard of pyspider, set the project to DEBUG or RUNNING, and then run the project.

Expected behavior

The project should be running and fetching data.

Actual behavior

The RUN button in the dashboard became red when I click it.

The project isn't running, all pyspider modules seems to be alright, except the scheduler module.

It says that the project is unknown, its log is as the following:

[I 190825 05:53:47 scheduler:647] scheduler starting...,
[I 190825 05:53:47 scheduler:782] scheduler.xmlrpc listening on 0.0.0.0:23333,
[I 190825 05:53:48 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0,
[E 190825 05:54:11 scheduler:306] unknown project: test,
[I 190825 05:54:48 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0,
[I 190825 05:55:48 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0,
[I 190825 05:56:48 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0,
[I 190825 05:57:48 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0,
[I 190825 05:58:48 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0

Some clues

This bug won't happend if I use mysql instead of mongo. Am I doing right when connecting the mongo database? The key bash sentences are the following:

# mongo
docker run --name mongo_pyspider -d -p 27017:27017 mongo:latest
# scheduler
docker run --name pyspider_scheduler -d --link mongo_pyspider:mongo --link rabbitmq_pyspider:rabbitmq binux/pyspider:latest scheduler
herbertdai commented 2 years ago

I have the same problem when I use mysql as well.