NanmiCoder / MediaCrawler

小红书笔记 | 评论爬虫、抖音视频 | 评论爬虫、快手视频 | 评论爬虫、B 站视频 | 评论爬虫、微博帖子 | 评论爬虫、百度贴吧帖子 | 百度贴吧评论回复爬虫 | 知乎问答文章|评论爬虫
https://nanmicoder.github.io/MediaCrawler/
Other
17.27k stars 5.43k forks source link

为什么爬取同一个博主的170个笔记就会发生错误呢,笔记详细爬取了下来但是评论爬取不下来,这是以下错误信息 #450

Open 707806032 opened 2 weeks ago

707806032 commented 2 weeks ago

2024-10-08 21:50:36 MediaCrawler INFO (core.py:257) - [XiaoHongShuCrawler.get_comments] Begin get note id comments 64a966e8000000000f00e70c 2024-10-08 21:50:36 MediaCrawler INFO (core.py:257) - [XiaoHongShuCrawler.get_comments] Begin get note id comments 64a55f9d00000000310091ba 2024-10-08 21:50:40 MediaCrawler INFO (core.py:257) - [XiaoHongShuCrawler.get_comments] Begin get note id comments 64964f3a000000000800dfa8 2024-10-08 21:50:40 MediaCrawler INFO (core.py:257) - [XiaoHongShuCrawler.get_comments] Begin get note id comments 6495029b00000000140273e5 2024-10-08 21:50:40 asyncio WARNING (proactor_events.py:353) - socket.send() raised exception. 2024-10-08 21:50:40 MediaCrawler INFO (core.py:257) - [XiaoHongShuCrawler.get_comments] Begin get note id comments 64925f880000000013001496 2024-10-08 21:50:40 asyncio WARNING (proactor_events.py:353) - socket.send() raised exception. 2024-10-08 21:50:44 MediaCrawler INFO (core.py:257) - [XiaoHongShuCrawler.get_comments] Begin get note id comments 648fcd3b0000000013014332 2024-10-08 21:50:44 MediaCrawler INFO (core.py:257) - [XiaoHongShuCrawler.get_comments] Begin get note id comments 648b2b7e000000000703a035 Traceback (most recent call last): File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\anyio_core_tasks.py", line 115, in fail_after yield cancel_scope File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpcore_backends\anyio.py", line 114, in connect_tcp stream: anyio.abc.ByteStream = await anyio.connect_tcp( ^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\anyio_core_sockets.py", line 219, in connect_tcp await event.wait() File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\anyio_backends_asyncio.py", line 1662, in wait await self._event.wait() File "C:\Users\HUAWEI\AppData\Local\Programs\Python\Python312\Lib\asyncio\locks.py", line 212, in wait await fut asyncio.exceptions.CancelledError: Cancelled by cancel scope 1c4591ead50

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpcore_exceptions.py", line 10, in map_exceptions yield File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpcore_backends\anyio.py", line 113, in connect_tcp with anyio.fail_after(timeout): File "C:\Users\HUAWEI\AppData\Local\Programs\Python\Python312\Lib\contextlib.py", line 158, in exit self.gen.throw(value) File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\anyio_core_tasks.py", line 118, in fail_after raise TimeoutError TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpx_transports\default.py", line 60, in map_httpcore_exceptions yield File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpx_transports\default.py", line 353, in handle_async_request resp = await self._pool.handle_async_request(req) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpcore_async\connection_pool.py", line 262, in handle_async_request raise exc File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpcore_async\connection_pool.py", line 245, in handle_async_request response = await connection.handle_async_request(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpcore_async\connection.py", line 92, in handle_async_request raise exc File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpcore_async\connection.py", line 69, in handle_async_request stream = await self._connect(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpcore_async\connection.py", line 117, in _connect stream = await self._network_backend.connect_tcp(**kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpcore_backends\auto.py", line 31, in connect_tcp return await self._backend.connect_tcp( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpcore_backends\anyio.py", line 112, in connect_tcp with map_exceptions(exc_map): File "C:\Users\HUAWEI\AppData\Local\Programs\Python\Python312\Lib\contextlib.py", line 158, in exit self.gen.throw(value) File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpcore_exceptions.py", line 14, in map_exceptions raise to_exc(exc) from exc httpcore.ConnectTimeout

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\tenacity_asyncio.py", line 50, in call result = await fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\media_platform\xhs\client.py", line 86, in request response = await client.request( ^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpx_client.py", line 1530, in request return await self.send(request, auth=auth, follow_redirects=follow_redirects) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpx_client.py", line 1617, in send response = await self._send_handling_auth( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpx_client.py", line 1645, in _send_handling_auth response = await self._send_handling_redirects( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpx_client.py", line 1682, in _send_handling_redirects response = await self._send_single_request(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpx_client.py", line 1719, in _send_single_request response = await transport.handle_async_request(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpx_transports\default.py", line 352, in handle_async_request with map_httpcore_exceptions(): File "C:\Users\HUAWEI\AppData\Local\Programs\Python\Python312\Lib\contextlib.py", line 158, in exit self.gen.throw(value) File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\httpx_transports\default.py", line 77, in map_httpcore_exceptions raise mapped_exc(message) from exc httpx.ConnectTimeout

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\main.py", line 58, in asyncio.get_event_loop().run_until_complete(main()) File "C:\Users\HUAWEI\AppData\Local\Programs\Python\Python312\Lib\asyncio\base_events.py", line 687, in run_until_complete return future.result() ^^^^^^^^^^^^^^^ File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\main.py", line 47, in main await crawler.start() File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\media_platform\xhs\core.py", line 84, in start await self.get_creators_and_notes() File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\media_platform\xhs\core.py", line 160, in get_creators_and_notes await self.batch_get_note_comments(note_ids) File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\media_platform\xhs\core.py", line 252, in batch_get_note_comments await asyncio.gather(task_list) File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\media_platform\xhs\core.py", line 258, in get_comments await self.xhs_client.get_note_all_comments( File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\media_platform\xhs\client.py", line 288, in get_note_all_comments comments_res = await self.get_note_comments(note_id, comments_cursor) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\media_platform\xhs\client.py", line 249, in get_note_comments return await self.get(uri, params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\media_platform\xhs\client.py", line 116, in get return await self.request(method="GET", url=f"{self._host}{final_uri}", headers=headers) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\tenacity_asyncio.py", line 88, in async_wrapped return await fn(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\tenacity_asyncio.py", line 47, in call do = self.iter(retry_state=retry_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\HUAWEI\OneDrive\文档\GitHub\MediaCrawler\venv\Lib\site-packages\tenacity__init__.py", line 326, in iter raise retry_exc from fut.exception() tenacity.RetryError: RetryError[<Future at 0x1c455019190 state=finished raised ConnectTimeout>]

NanmiCoder commented 1 week ago

连接超时可能是自媒体平台拒绝连接了,另外评论获取不到是不是没有开启评论爬取模式。

707806032 commented 1 week ago

评论爬取我打开了,50个作品以内是没有问题的,评论内容都爬取的下来,但是超过100多个应该就不太可以了

---- 回复的原邮件 ---- | 发件人 | @.> | | 日期 | 2024年10月15日 21:38 | | 收件人 | @.> | | 抄送至 | @.>@.> | | 主题 | Re: [NanmiCoder/MediaCrawler] 为什么爬取同一个博主的170个笔记就会发生错误呢,笔记详细爬取了下来但是评论爬取不下来,这是以下错误信息 (Issue #450) |

连接超时可能是自媒体平台拒绝连接了,另外评论获取不到是不是没有开启评论爬取模式。

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>