TeamHG-Memex / scrapy-crawl-once

Scrapy middleware which allows to crawl only new content
MIT License
79 stars 23 forks source link

URLs in start_urls are not affected #1

Open bezkos opened 7 years ago

bezkos commented 7 years ago

I have a spider crawl only detail pages and they are never skipped by this middleware.

kmike commented 7 years ago

A good catch; we need to add process_start_requests method as well.

Verz1Lka commented 6 years ago

@bezkos Are you use meta={'crawl_once': True}? I tested middleware using this simple spider, and that's works correctly.

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, meta={'crawl_once': True})

def parse(self, response):
    yield {
        'title': response.css('h1 a::text').extract_first(),
    }

First run - request sent.

{'crawl_once/initial': 0,
 'crawl_once/stored': 1,
 'downloader/request_bytes': 231,
 'downloader/request_count': 1}

Second run - request ignored.

{'crawl_once/ignored': 1,
 'crawl_once/initial': 1,
 'downloader/exception_count': 1,
 'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1}

Note: requests generated by start_urls has not crawl_once in meta dictionary by default. For append it, use start_requests method.

Can you explain what problem you had?