Design: How to Design a better spider???

accwill commented 3 years ago

我想设计好一点这个东西，以后必定经常使用，也必然会出electron版本。

前期仿照Python的scrapy架构去做。

具体思路如下：

ask 负责请求

spider 负责分析数据

engine 核心分发任务

schduer 调度请求

pipeline spider解析后的数据，可以在这里做数据存储

engine -> spider -> schedule -> ask -> spider -> pipeline -> data

accwill commented 3 years ago

通信问题，engine如何调度 ask spider schedule pipeline
如何跟踪，每一个url的存在到销毁
如何暂停与恢复
如何达到最大请求量
如何设置优先级去调度
如何通过一个url爬取整个站点，分析整个站点，过滤多余数据
嵌套关系如何梳理 a -> b -> c ->...,通过url（a）获取了一个json数据，得到url（b），然后通过解析b，得到一个json，得到c。。如何保存a , b , c 的嵌套层级及数据关系

accwill commented 3 years ago

我在网上找到了两张scrapy的图片可以看一下。

The Engine gets the initial Requests to crawl from the Spider.
The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
The Scheduler returns the next Requests to the Engine.
The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()).
Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()).
The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()).
The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()).
The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.
The process repeats (from step 1) until there are no more requests from the Scheduler.

accwill commented 3 years ago

目前想的是在初始化Engine的时候将所有的Spiders和ItemPipelines传入，在指定请求url时指定此url由哪个Spider 进行处理，spider在解析完成数据的时候可以决定由哪个itempipeline进行处理数据。

暂时这么处理了

accwill commented 3 years ago

已经实现基本功能，所有Middlewares均未添加，暂时未在项目中用到，不支持IP，HTTP2，以后用到的时候慢慢的将其完善

accwill / spider