eventuallyc0nsistent / arachne

A flask API for running your scrapy spiders
http://arachne.readthedocs.org/en/latest/
Other
128 stars 36 forks source link

Dynamic-endpoint support #11

Open Strahivan opened 7 years ago

Strahivan commented 7 years ago

It would be nice if I could parse dynamic endpoints(in SPIDER_SETTINGS) like: 'endpoint': 'crawl/'

eventuallyc0nsistent commented 7 years ago

Hi @Strahivan,

This sounds like great idea. I want to include this in the next version of arachne. For now here's how you can get around it.

First update your spider class and add a __init__ method to it like so, I'll explain below:


class CraigslistSpider(scrapy.Spider):
    """
    Tickets spider for NYC
    """
   ...

    def __init__(self, *args, **kwargs):
        super(CraigslistSpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://newyork.craigslist.org/search/tia#list/%s' % kwargs.get('your_url_param')]

    def parse(self, response):
        ...

Next update the run_spider_endpoint function in arachne to look like

def run_spider_endpoint(spider_name):
    """Search for the spider_name in the SPIDER_SETTINGS dict and
    start running the spider with the Scrapy API
    .. version 0.4.0:
        endpoint returns the `status` as `running` and a way to go back to `home` endpoint
    """

    for item in app.config['SPIDER_SETTINGS']:
        if spider_name in item['endpoint']:
            spider_loc = '%s.%s' % (item['location'], item['spider'])
            start_crawler(spider_loc, app.config, item.get('scrapy_settings'), **request.args)
            return jsonify(home=request.url_root, status='running', spider_name=spider_name)
    return abort(404)

Here you are just passing the request.args from flask to the crawler.

And finally update start_crawler function in arachne again to look like

def start_crawler(spider_loc, flask_app_config, spider_scrapy_settings, *args, **kwargs):
    spider = load_object(spider_loc)
    settings = get_spider_settings(flask_app_config, spider_scrapy_settings)
    current_app.logger.debug(kwargs)

    if SCRAPY_VERSION <= (1, 0, 0):
        start_logger(flask_app_config['DEBUG'])
        crawler = create_crawler_object(spider(*args, **kwargs), settings)
        crawler.start()

    else:
        spider.custom_settings = settings
        flask_app_config['CRAWLER_PROCESS'].crawl(spider, *args, **kwargs)

This will pass all your URL parameters to the spider class __init__ method.

So if you run spider URL looked like http://localhost:8080/run-spider/cg-spider?myid=1

You could now query for the myid parameter in your spider class with kwargs.get('myid')


class CraigslistSpider(scrapy.Spider):
    """
    Tickets spider for NYC
    """
   ...

    def __init__(self, *args, **kwargs):
        super(CraigslistSpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://newyork.craigslist.org/search/tia#list/%d' % kwargs.get('myid')]

    def parse(self, response):
        ...

Hope that helps

dalalRohit commented 4 years ago

Did all the steps as above, but getting this error: NameError: name 'current_app' is not defined Can you please help me solve this?