TeamHG-Memex / scrapy-crawl-once

Scrapy middleware which allows to crawl only new content
MIT License
79 stars 23 forks source link
scrapy

scrapy-crawl-once

.. image:: https://img.shields.io/pypi/v/scrapy-crawl-once.svg :target: https://pypi.python.org/pypi/scrapy-crawl-once :alt: PyPI Version

.. image:: https://travis-ci.org/TeamHG-Memex/scrapy-crawl-once.svg?branch=master :target: http://travis-ci.org/TeamHG-Memex/scrapy-crawl-once :alt: Build Status

.. image:: http://codecov.io/github/TeamHG-Memex/scrapy-crawl-once/coverage.svg?branch=master :target: http://codecov.io/github/TeamHG-Memex/scrapy-crawl-once?branch=master :alt: Code Coverage

This package provides a Scrapy_ middleware which allows to avoid re-crawling pages which were already downloaded in previous crawls.

.. _Scrapy: https://scrapy.org/

License is MIT.

Installation

::

pip install scrapy-crawl-once

Usage

To enable it, modify your settings.py::

SPIDER_MIDDLEWARES = {
    # ...
    'scrapy_crawl_once.CrawlOnceMiddleware': 100,
    # ...
}

DOWNLOADER_MIDDLEWARES = {
    # ...
    'scrapy_crawl_once.CrawlOnceMiddleware': 50,
    # ...
}

By default it does nothing. To avoid crawling a particular page multiple times set request.meta['crawl_once'] = True. When a response is received and a callback is successful, the fingerprint of such request is stored to a database. When spider schedules a new request middleware first checks if its fingerprint is in the database, and drops the request if it is there.

Other request.meta keys:

Settings

Alternatives

https://github.com/scrapy-plugins/scrapy-deltafetch is a similar package; it does almost the same. Differences:

Another alternative is a built-in Scrapy HTTP cache_. Differences:

.. _Scrapy HTTP cache: https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpcache

Contributing

To run tests, install tox_ and run tox from the source checkout.

.. _tox: https://tox.readthedocs.io/en/latest/


.. image:: https://hyperiongray.s3.amazonaws.com/define-hg.svg :target: https://www.hyperiongray.com/?pk_campaign=github&pk_kwd=scrapy-crawl-once :alt: define hyperiongray