Neutrino3316 / rss_spider

A spider to grap and store data from rss source.
0 stars 0 forks source link

RSS Spider

中文 | English

CHANGELOG

RSS Spider is a repo aim to fetch and save data from known RSS source.(is NOT fetch data from normal website and generate RSS source)

The reasons of this repo are:

  1. RSS provides formatted data, very easy for us to fetch and save data
  2. The RSS source generate new data continuously, I hope I can deposit these data in my own database.
  3. If I crawl every website on my own, I have to write different spider for different website. A way to reduce my workload, I can use the RSS generated by RSSHub, which has many codes to fetch data from different website.
  4. By setting up a docker container, I can easily grab data all the time, and it's convenient to manage it.

This repo is still working in progress, any suggestions, issues and pull requests are warmly welcomed.

Project Structure

File Structure

Configuration

It is highly recommended to use docker to run this project.

config.yml

An example config.yml is as the following:

mongodb:
    link: mongodb://localhost:27017
rsshub:
    host: http://localhost:1200/
rss:
    zhihu_hotlist:
        link: https://rsshub.app/zhihu/hotlist
        key_list:
            - title
            - link
            - published
            - author
            - summary

For the time being, only mongoDB database is supported.

run your own RSSHub (optional)

It's highly recommended to run your own RSSHub. On one hand, it can speed up the refresh rate, on the other hand, it can reduce the work load of the sever own by the RSSHub author.

You may deploy your own rsshub using the offical online document of RSSHub.

I use the following command to run rsshub in docker, I reduce the cache time, so that it can refresh more quickly.

docker run -d --name rsshub_diygod --restart=always -p 1200:1200 \
    -e CACHE_EXPIRE=5 -e CACHE_CONTENT_EXPIRE=60 \
    diygod/rsshub:latest

run your own mongoDB in docker (optional)

docker run -d -p 27017:27017 --name mongo_rss_spider --restart=always \
    -v mongo_rss_spider_data_configdb:/data/configdb \
    -v mongo_rss_spider_data_db:/data/db \
    -v d:/docker_mount/mongo_rss_spider_backup:/mongo_backup \
    mongo

Acknowledgement

Many thanks to the project RSSHub which is written by DIYgod