NanmiCoder / MediaCrawler

小红书笔记 | 评论爬虫、抖音视频 | 评论爬虫、快手视频 | 评论爬虫、B 站视频 | 评论爬虫、微博帖子 | 评论爬虫、百度贴吧帖子 | 百度贴吧评论回复爬虫 | 知乎问答文章|评论爬虫
https://nanmicoder.github.io/MediaCrawler/
Other
18.1k stars 5.6k forks source link

爬取指定微博账号的信息出错 #504

Closed JianxunRao closed 1 day ago

JianxunRao commented 2 days ago

看起来是触发风控了? 我是第一次爬取微博,抓了一个账号下的800条数据,就直接被风控了吗? 针对这个问题,有没有什么办法,比如降低爬取的频率呢? 谢谢!

完整log如下:

(venv) trojx@MBP16  ~/PycharmProjects/MediaCrawler   main ±  python main.py --platform wb --type creator --get_comment true --get_sub_comment false --save_data_option csv /Users/trojx/PycharmProjects/MediaCrawler/venv/lib/python3.9/site-packages/urllib3/init.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020 warnings.warn( 2024-11-28 17:47:57 MediaCrawler INFO (proxy_ip_pool.py:59) - [ProxyIpPool._is_valid_proxy] testing 114.106.137.104 is it valid 2024-11-28 17:47:58 MediaCrawler INFO (proxy_ip_pool.py:71) - [ProxyIpPool._is_valid_proxy] testing 114.106.137.104 err: 503 Service Unavailable 2024-11-28 17:47:59 MediaCrawler INFO (proxy_ip_pool.py:59) - [ProxyIpPool._is_valid_proxy] testing 36.151.192.236 is it valid 2024-11-28 17:48:02 MediaCrawler INFO (core.py:313) - [WeiboCrawler.launch_browser] Begin create browser context ... 2024-11-28 17:48:04 MediaCrawler INFO (core.py:276) - [WeiboCrawler.create_weibo_client] Begin create weibo API client ... 2024-11-28 17:48:04 MediaCrawler INFO (client.py:89) - [WeiboClient.pong] Begin pong weibo... 2024-11-28 17:48:05 MediaCrawler ERROR (client.py:97) - [WeiboClient.pong] cookie may be invalid and again login... 2024-11-28 17:48:05 MediaCrawler INFO (login.py:48) - [WeiboLogin.begin] Begin login weibo ... 2024-11-28 17:48:05 MediaCrawler INFO (login.py:78) - [WeiboLogin.login_by_qrcode] Begin login weibo by qrcode ... 2024-11-28 17:48:06 MediaCrawler INFO (crawler_util.py:42) - [find_login_qrcode] get qrcode by _url:https://v2.qr.weibo.cn/inf/gen?api_key=a0241ed0d922ea76&data=https%3A%2F%2Fpassport.weibo.cn%2Fsignin%2Fqrcode%2Fscan%3Fqr%3D3ZGRnSDxWABWMgX7mExibMvvmvlTmlnfcBnFyY29kZQ..%26sinainternalbrowser%3Dtopnav%26showmenu%3D0&datetime=1732787286&deadline=0&level=M&logo=https%3A%2F%2Fimg.t.sinajs.cn%2Ft6%2Fstyle%2Fimages%2Findex%2Fweibo-logo.png&output_type=img&redirect=0&sign=b18c35d27bc8&size=180&start_time=0&title=sso&type=url_ 2024-11-28 17:48:06 MediaCrawler INFO (login.py:94) - [WeiboLogin.login_by_qrcode] Waiting for scan code login, remaining time is 20s 2024-11-28 17:48:35 MediaCrawler INFO (login.py:108) - [WeiboLogin.login_by_qrcode] Login successful then wait for 5 seconds redirect ... 2024-11-28 17:48:40 MediaCrawler INFO (core.py:86) - [WeiboCrawler.start] redirect weibo mobile homepage and update cookies on mobile platform 2024-11-28 17:48:43 MediaCrawler INFO (core.py:246) - [WeiboCrawler.get_creators_and_notes] Begin get weibo creators 2024-11-28 17:48:44 MediaCrawler INFO (client.py:290) - [WeiboClient.get_creator_info_by_id] get container_info : {'fid_container_id': '1005051892723783', 'lfid_container_id': '102803'} 2024-11-28 17:48:45 MediaCrawler ERROR (client.py:67) - [WeiboClient.request] request GET:https://m.weibo.cn/api/container/getIndex?jumpfrom=weibocom&type=uid&value=1892723783&containerid=1005051892723783 err, res:{'ok': -100, 'errno': '-100', 'msg': '', 'url': 'https://m.weibo.cn/api/geetest?testType=1&backUrl=https%3A%2F%2Fm.weibo.cn', 'extra': ''} Traceback (most recent call last): File "/Users/trojx/PycharmProjects/MediaCrawler/main.py", line 66, in asyncio.get_event_loop().run_until_complete(main()) File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete return future.result() File "/Users/trojx/PycharmProjects/MediaCrawler/main.py", line 56, in main await crawler.start() File "/Users/trojx/PycharmProjects/MediaCrawler/media_platform/weibo/core.py", line 100, in start await self.get_creators_and_notes() File "/Users/trojx/PycharmProjects/MediaCrawler/media_platform/weibo/core.py", line 248, in get_creators_and_notes createor_info_res: Dict = await self.wb_client.get_creator_info_by_id(creator_id=user_id) File "/Users/trojx/PycharmProjects/MediaCrawler/media_platform/weibo/client.py", line 301, in get_creator_info_by_id user_res = await self.get(uri, params) File "/Users/trojx/PycharmProjects/MediaCrawler/media_platform/weibo/client.py", line 80, in get return await self.request(method="GET", url=f"{self._host}{final_uri}", headers=headers, **kwargs) File "/Users/trojx/PycharmProjects/MediaCrawler/media_platform/weibo/client.py", line 68, in request raise DataFetchError(data.get("msg", "unkonw error")) media_platform.weibo.exception.DataFetchError

JianxunRao commented 2 days ago

代码版本: commit ca9b47ef63548d74a6e4487e6fcf458ab0b85342 (HEAD -> main, origin/main, origin/HEAD) Author: Relakkes relakkes@gmail.com Date: Wed Nov 27 09:41:24 2024 +0800

fix: xhs 帖子详情优化
NanmiCoder commented 1 day ago

weibo的风控一般在ip上,你可以尝试更换IP后再次尝试一下。

JianxunRao commented 1 day ago

weibo的风控一般在ip上,你可以尝试更换IP后再次尝试一下。

谢谢,发现切换代理ip,并且手动点击log中输出的geetest 验证码链接完成验证后,就能继续爬取。