Boris-code / feapder

🚀🚀🚀feapder is an easy to use, powerful crawler framework | feapder是一款上手简单,功能强大的Python爬虫框架。内置AirSpider、Spider、TaskSpider、BatchSpider四种爬虫解决不同场景的需求。且支持断点续爬、监控报警、浏览器渲染、海量数据去重等功能。更有功能强大的爬虫管理系统feaplat为其提供方便的部署及调度
http://feapder.com
Other
2.88k stars 476 forks source link

关于yield feapder.Request中的优先级问题 #205

Closed tisoz closed 1 year ago

tisoz commented 1 year ago

我的构想是abc三个任务 , 一条线程 , 按照如下顺序执行完成

a1->download中间件
a2->download中间件
a3->download中间件
    b1->download中间件
    b2->download中间件
    b3->download中间件
        c1->download中间件
        c2->download中间件
        c3->download中间件

但是我无论怎么调整 , 能够达到的效果都是

a1    b1    c1
                    a2    b2    c2
                                        a3    b3    c3

如果我不是abc三个任务 , 而是几千几万需要递归的 , 那中间产生的两个包之间的延迟 , 都是几个小时了 我写了个单例 , 输出结果如下 :

E:\program\py38\python.exe "F:\onedrive\OneDrive - TSCN\桌面(1)\feapder爬虫\Amazon\feapder_test.py" 
2023-03-21 11:36:07.691 | INFO     | feapder.core.scheduler:<lambda>:111 - 
********** feapder begin **********
2023-03-21 11:36:07.872 | INFO     | __main__:start_requests:36 - task us
2023-03-21 11:36:07.872 | INFO     | __main__:start_requests:36 - task jp
2023-03-21 11:36:07.872 | INFO     | __main__:start_requests:36 - task tr
2023-03-21 11:36:07.873 | INFO     | __main__:start_requests:36 - task es
2023-03-21 11:36:07.873 | INFO     | __main__:start_requests:36 - task fd
2023-03-21 11:36:07.873 | INFO     | __main__:start_requests:36 - task tg
2023-03-21 11:36:11.191 | INFO     | __main__:parse_valid_token:45 - tg111 | priority:940000
2023-03-21 11:36:11.191 | INFO     | __main__:parse_valid_token:45 - fd111 | priority:950000
2023-03-21 11:36:11.192 | INFO     | __main__:parse_valid_token:45 - es111 | priority:960000
2023-03-21 11:36:11.192 | INFO     | __main__:parse_valid_token:45 - tr111 | priority:970000
2023-03-21 11:36:11.192 | INFO     | __main__:parse_valid_token:45 - jp111 | priority:980000
2023-03-21 11:36:11.192 | INFO     | __main__:parse_valid_token:45 - us111 | priority:990000
2023-03-21 11:36:14.218 | INFO     | __main__:parse_csrf_token:54 - tg222 | priority:840000
2023-03-21 11:36:14.218 | INFO     | __main__:parse_csrf_token:54 - fd222 | priority:850000
2023-03-21 11:36:14.218 | INFO     | __main__:parse_csrf_token:54 - es222 | priority:860000
2023-03-21 11:36:14.218 | INFO     | __main__:parse_csrf_token:54 - tr222 | priority:870000
2023-03-21 11:36:14.218 | INFO     | __main__:parse_csrf_token:54 - jp222 | priority:880000
2023-03-21 11:36:14.218 | INFO     | __main__:parse_csrf_token:54 - us222 | priority:890000
2023-03-21 11:36:17.228 | INFO     | __main__:parse:64 - tg333 | priority:740000
2023-03-21 11:36:17.228 | INFO     | __main__:parse:64 - fd333 | priority:750000
2023-03-21 11:36:17.228 | INFO     | __main__:parse:64 - es333 | priority:760000
2023-03-21 11:36:17.228 | INFO     | __main__:parse:64 - tr333 | priority:770000
2023-03-21 11:36:17.228 | INFO     | __main__:parse:64 - jp333 | priority:780000
2023-03-21 11:36:17.228 | INFO     | __main__:parse:64 - us333 | priority:790000
2023-03-21 11:36:20.960 | INFO     | feapder.core.scheduler:<lambda>:116 - 
********** feapder end **********
2023-03-21 11:36:21.023 | INFO     | feapder.core.scheduler:spider_end:518 - 《amazon_temp:amazon_address_ck》爬虫结束,耗时 14秒
2023-03-21 11:36:21.206 | INFO     | feapder.core.scheduler:delete_tables:442 - 正在删除key amazon_temp:amazon_address_ck:z_spider_status

进程已结束,退出代码0

单例运行代码如下 :

import feapder
import feapder.utils.tools
from feapder.utils.log import log

class AMAZON_ASIN_test(feapder.Spider):
    # 自定义数据库,若项目中有setting.py文件,此自定义可删除
    __custom_setting__ = dict(
        # 框架日志等级
        LOG_LEVEL="INFO",
        LOG_COLOR=True,  # 是否带有颜色
        LOG_IS_WRITE_TO_CONSOLE=True,  # 是否打印到控制台
        MONGO_DB="Amazon_spider",  # 保存的库命
        SPIDER_MAX_RETRY_TIMES=3,
    )

    def init_base(self, save_table_name=None, item_list=[]):
        self.item_list = item_list
        self.save_table_name = save_table_name

    def download_midware(self, request):
        return request, {}

    def start_requests(self):
        country_list = {
            "us": "locationType=LOCATION_INPUT&zipCode=90001&storeContext=apparel&deviceType=web&pageType=Detail&actionSource=glow&almBrandId=undefined",
            "jp": "locationType=LOCATION_INPUT&zipCode=163-8001&storeContext=apparel&deviceType=web&pageType=Detail&actionSource=glow&almBrandId=undefined",
            "tr": "locationType=LOCATION_INPUT&zipCode=163-8001&storeContext=apparel&deviceType=web&pageType=Detail&actionSource=glow&almBrandId=undefined",
            "es": "locationType=LOCATION_INPUT&zipCode=163-8001&storeContext=apparel&deviceType=web&pageType=Detail&actionSource=glow&almBrandId=undefined",
            "fd": "locationType=LOCATION_INPUT&zipCode=163-8001&storeContext=apparel&deviceType=web&pageType=Detail&actionSource=glow&almBrandId=undefined",
            "tg": "locationType=LOCATION_INPUT&zipCode=163-8001&storeContext=apparel&deviceType=web&pageType=Detail&actionSource=glow&almBrandId=undefined",
        }
        priority = 1000000
        for i in country_list:
            priority -= 10000
            log.info(f"task {i}")
            yield feapder.Request(
                url=f"https://www.amazon.com/{i}",
                priority=priority,
                country=i,
                auto_request=False,
                callback=self.parse_valid_token)

    def parse_valid_token(self, request, response):
        log.info(f"{request.country}111 | priority:{request.priority}")
        yield feapder.Request(
            url=f"https://www.amazon.com/a{request.country}",
            priority=request.priority - 100000,
            auto_request=False,
            country=request.country,
            callback=self.parse_csrf_token)

    def parse_csrf_token(self, request, response):
        log.info(f"{request.country}222 | priority:{request.priority}")

        yield feapder.Request(
            url=f"https://www.amazon.com/b{request.country}",
            priority=request.priority - 100000,
            auto_request=False,
            country=request.country,
            callback=self.parse)

    def parse(self, request, response):
        log.info(f"{request.country}333 | priority:{request.priority}")

if __name__ == "__main__":
    save_table_name = "amazon_address_ck"
    amazon = AMAZON_ASIN_test(redis_key=f"amazon_temp:{save_table_name}", delete_keys=True, thread_count=1)
    amazon.init_base(save_table_name=f"amazon:{save_table_name}")
    amazon.start()
tisoz commented 1 year ago

已解决 class AMAZON_ASIN_test(feapder.AirSpider): 切换成轻量模型后 , 优先级按照预期工作 , 分布式模型不行 切换后日志

E:\program\py38\python.exe "F:\onedrive\OneDrive - TSCN\桌面(1)\feapder爬虫\Amazon\feapder_test.py" 
2023-03-21 13:38:05.076 | INFO     | __main__:start_requests:36 - task us
2023-03-21 13:38:05.076 | INFO     | __main__:start_requests:36 - task jp
2023-03-21 13:38:05.076 | INFO     | __main__:start_requests:36 - task tr
2023-03-21 13:38:05.076 | INFO     | __main__:start_requests:36 - task es
2023-03-21 13:38:05.076 | INFO     | __main__:start_requests:36 - task fd
2023-03-21 13:38:05.076 | INFO     | __main__:start_requests:36 - task tg
2023-03-21 13:38:06.079 | INFO     | __main__:parse_valid_token:45 - us111 | priority:3
2023-03-21 13:38:06.079 | INFO     | __main__:parse_csrf_token:54 - us222 | priority:2
2023-03-21 13:38:06.079 | INFO     | __main__:parse:64 - us333 | priority:1
2023-03-21 13:38:06.079 | INFO     | __main__:parse_valid_token:45 - tr111 | priority:3
2023-03-21 13:38:06.079 | INFO     | __main__:parse_csrf_token:54 - tr222 | priority:2
2023-03-21 13:38:06.079 | INFO     | __main__:parse:64 - tr333 | priority:1
2023-03-21 13:38:06.079 | INFO     | __main__:parse_valid_token:45 - jp111 | priority:3
2023-03-21 13:38:06.080 | INFO     | __main__:parse_csrf_token:54 - jp222 | priority:2
2023-03-21 13:38:06.080 | INFO     | __main__:parse:64 - jp333 | priority:1
2023-03-21 13:38:06.080 | INFO     | __main__:parse_valid_token:45 - fd111 | priority:3
2023-03-21 13:38:06.080 | INFO     | __main__:parse_csrf_token:54 - fd222 | priority:2
2023-03-21 13:38:06.080 | INFO     | __main__:parse:64 - fd333 | priority:1
2023-03-21 13:38:06.080 | INFO     | __main__:parse_valid_token:45 - es111 | priority:3
2023-03-21 13:38:06.080 | INFO     | __main__:parse_csrf_token:54 - es222 | priority:2
2023-03-21 13:38:06.080 | INFO     | __main__:parse:64 - es333 | priority:1
2023-03-21 13:38:06.080 | INFO     | __main__:parse_valid_token:45 - tg111 | priority:3
2023-03-21 13:38:06.080 | INFO     | __main__:parse_csrf_token:54 - tg222 | priority:2
2023-03-21 13:38:06.080 | INFO     | __main__:parse:64 - tg333 | priority:1
2023-03-21 13:38:10.098 | INFO     | feapder.core.spiders.air_spider:run:104 - 无任务,爬虫结束

进程已结束,退出代码0
Boris-code commented 1 year ago

分布式会取一批任务到内存,然后再消费,分批取时是按照优先级的。可能你的任务太少,第一批只取到了a1 b1 c1,第二批才取到a2 b2 c2

AirSpider 和 Spider的选择,取决于你的任务量需不需要分布式,成百上千万的用Spider比较好