Open bezkos opened 7 years ago
A good catch; we need to add process_start_requests
method as well.
@bezkos Are you use meta={'crawl_once': True}
?
I tested middleware using this simple spider, and that's works correctly.
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, meta={'crawl_once': True})
def parse(self, response):
yield {
'title': response.css('h1 a::text').extract_first(),
}
First run - request sent.
{'crawl_once/initial': 0,
'crawl_once/stored': 1,
'downloader/request_bytes': 231,
'downloader/request_count': 1}
Second run - request ignored.
{'crawl_once/ignored': 1,
'crawl_once/initial': 1,
'downloader/exception_count': 1,
'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1}
Note: requests generated by start_urls has not crawl_once
in meta dictionary by default. For append it, use start_requests method.
Can you explain what problem you had?
I have a spider crawl only detail pages and they are never skipped by this middleware.