Open Strahivan opened 7 years ago
Hi @Strahivan,
This sounds like great idea. I want to include this in the next version of arachne
. For now here's how you can get around it.
First update your spider class and add a __init__
method to it like so, I'll explain below:
class CraigslistSpider(scrapy.Spider):
"""
Tickets spider for NYC
"""
...
def __init__(self, *args, **kwargs):
super(CraigslistSpider, self).__init__(*args, **kwargs)
self.start_urls = ['http://newyork.craigslist.org/search/tia#list/%s' % kwargs.get('your_url_param')]
def parse(self, response):
...
Next update the run_spider_endpoint
function in arachne
to look like
def run_spider_endpoint(spider_name):
"""Search for the spider_name in the SPIDER_SETTINGS dict and
start running the spider with the Scrapy API
.. version 0.4.0:
endpoint returns the `status` as `running` and a way to go back to `home` endpoint
"""
for item in app.config['SPIDER_SETTINGS']:
if spider_name in item['endpoint']:
spider_loc = '%s.%s' % (item['location'], item['spider'])
start_crawler(spider_loc, app.config, item.get('scrapy_settings'), **request.args)
return jsonify(home=request.url_root, status='running', spider_name=spider_name)
return abort(404)
Here you are just passing the request.args
from flask to the crawler.
And finally update start_crawler
function in arachne
again to look like
def start_crawler(spider_loc, flask_app_config, spider_scrapy_settings, *args, **kwargs):
spider = load_object(spider_loc)
settings = get_spider_settings(flask_app_config, spider_scrapy_settings)
current_app.logger.debug(kwargs)
if SCRAPY_VERSION <= (1, 0, 0):
start_logger(flask_app_config['DEBUG'])
crawler = create_crawler_object(spider(*args, **kwargs), settings)
crawler.start()
else:
spider.custom_settings = settings
flask_app_config['CRAWLER_PROCESS'].crawl(spider, *args, **kwargs)
This will pass all your URL parameters to the spider class __init__
method.
So if you run spider URL looked like http://localhost:8080/run-spider/cg-spider?myid=1
You could now query for the myid
parameter in your spider class with kwargs.get('myid')
class CraigslistSpider(scrapy.Spider):
"""
Tickets spider for NYC
"""
...
def __init__(self, *args, **kwargs):
super(CraigslistSpider, self).__init__(*args, **kwargs)
self.start_urls = ['http://newyork.craigslist.org/search/tia#list/%d' % kwargs.get('myid')]
def parse(self, response):
...
Hope that helps
Did all the steps as above, but getting this error: NameError: name 'current_app' is not defined Can you please help me solve this?
It would be nice if I could parse dynamic endpoints(in SPIDER_SETTINGS) like: 'endpoint': 'crawl/'