Open ganlnyn0000 opened 2 weeks ago
我使用最新版本,替换了test.yaml中的Cookie,下面是测试代码: async with DouyinCrawler(TestConfigManager.get_test_config("douyin")) as crawler: sec_uid = "MS4wLjABAAAARHbYBn84JChECWkdFOJ0r8t6jaxCS6VNSCGl4SpP0pE" params = UserProfile( sec_user_id=sec_uid, ) response = await crawler.fetch_user_profile(params) assert response, "Failed to fetch user profile" print(f"aweme_count: {response.get('user').get('aweme_count')}") max_cursor = 0 aweme_count = 0 p = 1 while True: params = UserPost( max_cursor=max_cursor, count=10, sec_user_id=sec_uid, ) response = await crawler.fetch_user_post(params) assert response, "Failed to fetch user post" print(f"page {p}: aweme_list count: {len(response.get('aweme_list'))}, has_more: {response.get('has_more')}") p = p+1 aweme_count = aweme_count+len(response.get('aweme_list'))
#video_id = video.aweme_id #print(video_id) if response.get('has_more')==0 or len(response.get('aweme_list'))==0: break max_cursor = response['max_cursor'] print(f"fetch_user_post aweme_count: {aweme_count}")
下面是运行结果打印(由于数据太多没有打印详细结果): aweme_count: 241 page 1: aweme_list count: 9, has_more: 1 page 2: aweme_list count: 10, has_more: 1 page 3: aweme_list count: 2, has_more: 1 page 4: aweme_list count: 0, has_more: 1 fetch_user_post aweme_count: 21
这个sec_uid有241个视频,获取到第4页的时候aweme_list就为[]了,总共获取了21视频,我测试了10个sec_uid,有5个可以全部获取,有5个只能获取部分,和视频数量多少也没有关系,有的号有几千个视频也能全部获取 @Johnserf-Seed 请看下这是什么问题?
判断是否完整采集是根据字段内has_more这个参数来控制的,控制翻页的max_cursor参数是一个timestamp,因为这个作者在该时间段内没有发布过作品,所以你只需要根据has_more是否为0的条件来判断采集完毕,希望可以解答你的疑惑。@ganlnyn0000
has_more
max_cursor
timestamp
0
@Johnserf-Seed 多谢!我再试试
我使用最新版本,替换了test.yaml中的Cookie,下面是测试代码: async with DouyinCrawler(TestConfigManager.get_test_config("douyin")) as crawler: sec_uid = "MS4wLjABAAAARHbYBn84JChECWkdFOJ0r8t6jaxCS6VNSCGl4SpP0pE" params = UserProfile( sec_user_id=sec_uid, ) response = await crawler.fetch_user_profile(params) assert response, "Failed to fetch user profile" print(f"aweme_count: {response.get('user').get('aweme_count')}") max_cursor = 0 aweme_count = 0 p = 1 while True: params = UserPost( max_cursor=max_cursor, count=10, sec_user_id=sec_uid, ) response = await crawler.fetch_user_post(params) assert response, "Failed to fetch user post" print(f"page {p}: aweme_list count: {len(response.get('aweme_list'))}, has_more: {response.get('has_more')}") p = p+1 aweme_count = aweme_count+len(response.get('aweme_list'))
video = UserPostFilter(response)
下面是运行结果打印(由于数据太多没有打印详细结果): aweme_count: 241 page 1: aweme_list count: 9, has_more: 1 page 2: aweme_list count: 10, has_more: 1 page 3: aweme_list count: 2, has_more: 1 page 4: aweme_list count: 0, has_more: 1 fetch_user_post aweme_count: 21
这个sec_uid有241个视频,获取到第4页的时候aweme_list就为[]了,总共获取了21视频,我测试了10个sec_uid,有5个可以全部获取,有5个只能获取部分,和视频数量多少也没有关系,有的号有几千个视频也能全部获取 @Johnserf-Seed 请看下这是什么问题?