Open accwill opened 3 years ago
我在网上找到了两张scrapy的图片可以看一下。
The Engine gets the initial Requests to crawl from the Spider.
The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
The Scheduler returns the next Requests to the Engine.
The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()).
Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()).
The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()).
The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()).
The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.
The process repeats (from step 1) until there are no more requests from the Scheduler.
目前想的是在初始化Engine的时候将所有的Spiders和ItemPipelines传入,在指定请求url时指定此url由哪个Spider 进行处理,spider在解析完成数据的时候可以决定由哪个itempipeline进行处理数据。
暂时这么处理了
已经实现基本功能,所有Middlewares均未添加,暂时未在项目中用到,不支持IP,HTTP2,以后用到的时候慢慢的将其完善
我想设计好一点这个东西,以后必定经常使用,也必然会出electron版本。
前期仿照Python的scrapy架构去做。
具体思路如下:
ask 负责请求
spider 负责分析数据
engine 核心分发任务
schduer 调度请求
pipeline spider解析后的数据,可以在这里做数据存储
engine -> spider -> schedule -> ask -> spider -> pipeline -> data