fuergaosi233 / gitbook2pdf

Grab the contents of the gitbook document and convert it to pdf
1.04k stars 175 forks source link

爬取到一定的数量的时候,出现disconnect #75

Open mybu opened 3 years ago

mybu commented 3 years ago

使用命令:Python版本 3.6.5 python gitbook.py https://wizardforcel.gitbooks.io/python-quant-uqer/content/ 根据爬取的日志,定位代码,优化了一个地方:增加了休眠时间 async def gettext(self, index, url, level, title): ''' return path's html '''

    secRnd = random.randint(2, 7)
    time.sleep(secRnd)
    print("防止压不住,设置暂停时间:{}秒,crawling : {}".format(secRnd, url))
    try:
        metatext = await request(url, self.headers, timeout=10)
    except Exception as e:
        time.sleep(secRnd)
        print("防止压不住,设置暂停时间:{}秒,recrawling : {}".format(secRnd, url))
        metatext = await request(url, self.headers)
    try:
        text = ChapterParser(metatext, title, level, ).parser()
        print("done : ", url)            
        self.content_list[index] = text
    except IndexError:
        print('faild at : ', url, ' maybe content is empty?')

但是到爬取到一定的时候,还是会出现disconnect的错误。 done : https://wizardforcel.gitbooks.io/python-quant-uqer/content/81.html Traceback (most recent call last): File "gitbook.py", line 5, in Gitbook2PDF(url).run() File "E:\code\pythonCode\thirdparty\gitbook2pdf-master\gitbook2pdf\gitbook2pdf.py", line 202, in run loop.run_until_complete(self.crawl_main_content(content_urls)) File "d:\ProgramData\Anaconda3\envs\python36\lib\asyncio\base_events.py", line 468, in run_until_complete return future.result() File "E:\code\pythonCode\thirdparty\gitbook2pdf-master\gitbook2pdf\gitbook2pdf.py", line 224, in crawl_main_content await asyncio.gather(*tasks) File "E:\code\pythonCode\thirdparty\gitbook2pdf-master\gitbook2pdf\gitbook2pdf.py", line 246, in gettext metatext = await request(url, self.headers) File "E:\code\pythonCode\thirdparty\gitbook2pdf-master\gitbook2pdf\gitbook2pdf.py", line 21, in request async with session.get(url, headers=headers, timeout=timeout) as resp: File "d:\ProgramData\Anaconda3\envs\python36\lib\site-packages\aiohttp\client.py", line 1005, in aenter self._resp = await self._coro File "d:\ProgramData\Anaconda3\envs\python36\lib\site-packages\aiohttp\client.py", line 497, in _request await resp.start(conn) File "d:\ProgramData\Anaconda3\envs\python36\lib\site-packages\aiohttp\client_reqrep.py", line 844, in start

message, payload = await self._protocol.read()  # type: ignore  # noqa

File "d:\ProgramData\Anaconda3\envs\python36\lib\site-packages\aiohttp\streams.py", line 588, in read await self._waiter aiohttp.client_exceptions.ServerDisconnectedError: None