drunkdream / weread-exporter

将微信读书中的书籍导出成epub、pdf、mobi等格式
1.03k stars 157 forks source link

部分书抓取时会卡死报错 #41

Open damoguyanzero opened 11 months ago

damoguyanzero commented 11 months ago

像在抓取c8832370813ab7fdbg016f39这本书时,版权页就会卡死,然后程序重新抓取,又卡死,如此反复循环,按ctrl+c终止后报错如下: D:\Software\weread-exporter-main>python -m weread_exporter -b c8832370813ab7fdbg016f39 -o epub [2023-08-01 20:07:58,420][INFO]Exporting book c8832370813ab7fdbg016f39 [2023-08-01 20:07:58,577][INFO][WeReadWebPage] Launch url https://weread.qq.com/web/bookDetail/c8832370813ab7fdbg016f39 [2023-08-01 20:07:59,110][INFO]Browser listening on: ws://127.0.0.1:50099/devtools/browser/8328d9f7-f693-45bd-bba7-3f599a42261e

[2023-08-01 20:08:05,902][INFO][WeReadExporter] Check chapter 2/版权信息 [2023-08-01 20:08:05,902][INFO][WeReadExporter] File cache\c8832370813ab7fdbg016f39\chapters\1-2.md not exist [2023-08-01 20:08:05,902][INFO][WeReadWebPage] Go to chapter 2 [2023-08-01 20:08:05,922][INFO][WeReadWebPage] Fetch url https://weread.qq.com/web/reader/c8832370813ab7fdbg016f39kc81322c012c81e728d9d180 [2023-08-01 20:08:06,209][INFO][WeReadWebPage] Fetch url https://midas.gtimg.cn/midas/minipay_v2/jsapi/cashier.js [2023-08-01 20:08:06,211][INFO][WeReadWebPage] Fetch url https://weread-1258476243.file.myqcloud.com/web/wrwebnjlogic/css/app.4605d864.css [2023-08-01 20:08:06,212][INFO][WeReadWebPage] Fetch url https://weread-1258476243.file.myqcloud.com/web/wrwebnjlogic/js/app.27ff86e3.js [2023-08-01 20:08:35,910][WARNING]Load chapter failed, close browser and retry [2023-08-01 20:08:35,911][INFO]terminate chrome process... [2023-08-01 20:08:35,911][ERROR]connection unexpectedly closed [2023-08-01 20:08:35,911][ERROR]Task exception was never retrieved future: <Task finished name='Task-275' coro=<Connection._async_send() done, defined at C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyppeteer\connection.py:69> exception=InvalidStateError('invalid state')> Traceback (most recent call last): File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\websockets\legacy\protocol.py", line 979, in transfer_data await asyncio.shield(self._put_message_waiter) asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyppeteer\connection.py", line 73, in _async_send await self.connection.send(msg) File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\websockets\legacy\protocol.py", line 635, in send await self.ensure_open() File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\websockets\legacy\protocol.py", line 944, in ensure_open raise self.connection_closed_exc() websockets.exceptions.ConnectionClosedError: sent 1000 (OK); no close frame received

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyppeteer\connection.py", line 79, in _async_send await self.dispose() File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyppeteer\connection.py", line 170, in dispose await self._on_close() File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyppeteer\connection.py", line 151, in _on_close cb.set_exception(_rewriteError( asyncio.exceptions.InvalidStateError: invalid state [2023-08-01 20:08:36,039][INFO][WeReadWebPage] Launch url https://weread.qq.com/web/bookDetail/c8832370813ab7fdbg016f39 [2023-08-01 20:08:36,575][INFO]Browser listening on: ws://127.0.0.1:50160/devtools/browser/5153350a-c749-4f25-9074-d84de4c8869a [2023-08-01 20:08:42,724][INFO][WeReadExporter] Check chapter 2/版权信息 [2023-08-01 20:08:42,724][INFO][WeReadExporter] File cache\c8832370813ab7fdbg016f39\chapters\1-2.md not exist [2023-08-01 20:08:42,724][INFO][WeReadWebPage] Go to chapter 2 [2023-08-01 20:08:42,735][INFO][WeReadWebPage] Fetch url https://weread.qq.com/web/reader/c8832370813ab7fdbg016f39kc81322c012c81e728d9d180 [2023-08-01 20:08:42,961][INFO][WeReadWebPage] Fetch url https://midas.gtimg.cn/midas/minipay_v2/jsapi/cashier.js [2023-08-01 20:08:42,962][INFO][WeReadWebPage] Fetch url https://weread-1258476243.file.myqcloud.com/web/wrwebnjlogic/css/app.4605d864.css [2023-08-01 20:08:42,963][INFO][WeReadWebPage] Fetch url https://weread-1258476243.file.myqcloud.com/web/wrwebnjlogic/js/app.27ff86e3.js Traceback (most recent call last): File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "D:\Software\weread-exporter-main\weread_exporter__main__.py", line 147, in main() File "D:\Software\weread-exporter-main\weread_exporter__main__.py", line 143, in main loop.run_until_complete(async_main()) File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 629, in run_until_complete self.run_forever() File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\asyncio\windows_events.py", line 321, in run_forever super().run_forever() File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 596, in run_forever self._run_once() File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 1854, in _run_once event_list = self._selector.select(timeout) File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\asyncio\windows_events.py", line 439, in select self._poll(timeout) File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\asyncio\windows_events.py", line 788, in _poll status = _overlapped.GetQueuedCompletionStatus(self._iocp, ms) File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyppeteer\launcher.py", line 153, in _close_process self._loop.run_until_complete(self.killChrome()) File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 618, in run_until_complete self._check_running() File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 578, in _check_running raise RuntimeError('This event loop is already running') RuntimeError: This event loop is already running [2023-08-01 20:08:54,035][INFO]terminate chrome process... [2023-08-01 20:08:54,035][ERROR]connection unexpectedly closed [2023-08-01 20:08:54,035][ERROR]Task exception was never retrieved future: <Task finished name='Task-544' coro=<Connection._async_send() done, defined at C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyppeteer\connection.py:69> exception=InvalidStateError('invalid state')> Traceback (most recent call last): File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyppeteer\connection.py", line 73, in _async_send await self.connection.send(msg) File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\websockets\legacy\protocol.py", line 635, in send await self.ensure_open() File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\websockets\legacy\protocol.py", line 944, in ensure_open raise self.connection_closed_exc() websockets.exceptions.ConnectionClosedError: sent 1000 (OK); no close frame received

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyppeteer\connection.py", line 79, in _async_send await self.dispose() File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyppeteer\connection.py", line 170, in dispose await self._on_close() File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyppeteer\connection.py", line 151, in _on_close cb.set_exception(_rewriteError( asyncio.exceptions.InvalidStateError: invalid state [2023-08-01 20:08:54,136][ERROR]Task exception was never retrieved future: <Task finished name='Task-4' coro=<Connection._recv_loop() done, defined at C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyppeteer\connection.py:53> exception=UnicodeEncodeError('gbk', '[https://weread.qq.com/web/reader/c8832370813ab7fdbg016f39kc81322c012c81e728d9d180] fillText © 0 881.3333339691162 JSHandle@array\r\n', 93, 94, 'illegal multibyte sequence')> Traceback (most recent call last): File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyppeteer\connection.py", line 61, in _recv_loop await self._on_message(resp) File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyppeteer\connection.py", line 143, in _on_message self._on_query(msg) File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyppeteer\connection.py", line 123, in _on_query session._on_message(params.get('message')) File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyppeteer\connection.py", line 276, in _on_message self.emit(obj.get('method'), obj.get('params')) File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyee_base.py", line 115, in emit handled = self._call_handlers(event, args, kwargs) File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyee_base.py", line 98, in _call_handlers self._emit_run(f, args, kwargs) File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyee_base.py", line 83, in _emit_run f(*args, *kwargs) File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyppeteer\page.py", line 184, in client.on('Runtime.consoleAPICalled', lambda event: self._onConsoleAPI(event)) File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyppeteer\page.py", line 692, in _onConsoleAPI self._addConsoleMessage(event['type'], values) File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyppeteer\page.py", line 729, in _addConsoleMessage self.emit(Page.Events.Console, message) File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyee_base.py", line 115, in emit handled = self._call_handlers(event, args, kwargs) File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyee_base.py", line 98, in _call_handlers self._emit_run(f, args, kwargs) File "C:\Users\huyuj\AppData\Local\Programs\Python\Python39\lib\site-packages\pyee_base.py", line 83, in _emit_run f(args, **kwargs) File "D:\Software\weread-exporter-main\weread_exporter\webpage.py", line 234, in handle_log fp.write("[%s] %s\n" % (self._url, message.text)) UnicodeEncodeError: 'gbk' codec can't encode character '\xa9' in position 93: illegal multibyte sequence [2023-08-01 20:08:54,182][ERROR]Task was destroyed but it is pending! task: <Task pending name='Task-179' coro=<WeReadWebPage._handle_request() running at D:\Software\weread-exporter-main\weread_exporter\webpage.py:337> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x0000025681125D30>()]>> sys:1: RuntimeWarning: coroutine 'Launcher.killChrome' was never awaited

drunkdream commented 11 months ago

我这边是正常的,会不会是网络之类的问题导致的?

damoguyanzero commented 11 months ago

应该不是网络问题,我其它的书可以成功抓取,但是这本书会出错 错误里面有一句UnicodeEncodeError: 'gbk' codec can't encode character '\xa9' in position 93: illegal multibyte sequence 是不是\xa9代表的©这个符号出错了?

376924098 commented 3 months ago

相同问题,偶发