binux / pyspider

A Powerful Spider(Web Crawler) System in Python.
http://docs.pyspider.org/
Apache License 2.0
16.51k stars 3.69k forks source link

How to debug / work with the ResultWorker #788

Open Amain opened 6 years ago

Amain commented 6 years ago

Thank you for creating pyspider. This is more a documentation request I suppose. I figured out how to crawl a site; following multiple page and links on pages using multiple self.crawl statements. I figured out how to create my own result worker, which, I intend to use, to send results to a database.

How can I test / debug this result worker?

When in "edit" mode, pressing run and following to one final result does not trigger the on_result call in my own worker.

From the "project overview" mode, when clicking run there, nothing happens as well.

I'm sure the worker is running, for debugging I added:

    def __init__(self, resultdb, inqueue):
        super(MyResultWorker, self).__init__(resultdb, inqueue)
        logging.info("MyResultWorker started")
$ cat config.json 
{
  "result_worker": {
    "result_cls": "MyResultWorker.MyResultWorker"
  }
}
$ pyspider -c config.json --debug all
phantomjs fetcher running on port 25555
[I 180514 22:56:44 MyResultWorker:7] MyResultWorker started
[I 180514 22:56:44 result_worker:49] result_worker starting...
[I 180514 22:56:44 tornado_fetcher:638] fetcher starting...
  1. What am I missing here?
  2. When is the on_result (of a overridden ResulWtorker called) called ?
  3. How to debug the ResultWorker framework?

Thanks in advance...

Amain commented 6 years ago

The following issues (including their workarounds) solve this usability problem partly:

After intensive reverse engineering I'm starting to figure out how this thing operates. It's easy to see that this framework is well thought through and can make life easier. Which is the the main purpose of frameworks.

Though documentation is lacking at the moment. Especially the flow of things, the relation between projects, crawlers methods, tasks, main configuration options like age=. Though it's all relatively simple, it takes to much time to figure out. You could improve there.

Having said that, thanks again for your time in the framework. I'm using it actively now.