binux / pyspider

A Powerful Spider(Web Crawler) System in Python.
http://docs.pyspider.org/
Apache License 2.0
16.51k stars 3.69k forks source link

puppeteer fetcher does not work #929

Open hubitor opened 5 years ago

hubitor commented 5 years ago

I'm trying to use the puppeteer fetcher with this script from the examples:

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    def on_start(self):
        self.crawl('http://www.twitch.tv/directory/game/Dota%202',
                   fetch_type='chrome', callback=self.index_page)

    def index_page(self, response):
        return {
            "url": response.url,
            "channels": [{
                "title": x('.title').text(),
                "viewers": x('.info').contents()[2],
                "name": x('.info a').text(),
            } for x in response.doc('.stream.item').items()]
        }

The result is this: {'channels': [], 'url': 'https://www.twitch.tv/directory/game/Dota%202'}

The puppeteer fetcher is supposed to be running since I see this when I start start pyspider: puppeteer fetcher running on port 22222

When I modify the content of the js_script and rerun the script, pyspider it doesn't do anything. It doesn't even give an error if I insert faulty code.

I've already found a related issue: https://github.com/binux/pyspider/issues/902

but it didn't help.

Expected behavior

Get results.

Actual behavior

No results.

How to reproduce

  1. Use latest development version of pyspider.
  2. Use above script
  3. Start pyspider & run the script.
larrymeng commented 4 years ago

This is wrong: fetch_type='chrome'

correct: fetch_type='puppeteer'

Because, you can find the answer in https://github.com/binux/pyspider/blob/master/pyspider/fetcher/tornado_fetcher.py#L141

elif task.get('fetch', {}).get('fetch_type') in ('puppeteer', ):
    ...