dataabc / weiboSpider

新浪微博爬虫,用python爬取新浪微博数据
8.15k stars 1.95k forks source link

运行着运行着就停止了,AttributeError: 'NoneType' object has no attribute 'xpath' #494

Closed minshengxinwen closed 1 year ago

minshengxinwen commented 1 year ago

为了更好的解决问题,请认真回答下面的问题。等到问题解决,请及时关闭本issue。

答:github

答:是

答:不知道。我只爬取了一个人的。

答:见下详情。

答:见下。 "user_id_list": ["5687069307"], "filter": 1, "since_date": "1999-01-01", "end_date": "now", "random_wait_pages": [1, 5], "random_wait_seconds": [6, 10], "global_wait": [[1000, 3600], [500, 2000]], "write_mode": ["csv", "txt"], "pic_download": 1, "video_download": 1, "file_download_timeout": [5, 5, 10], "result_dir_name": 0,

答:


就是说,你们家的小猫会不会在你工作的时候跳到你肩膀上,然后走到你头上……然后趁你还没发火又把脸凑过来亲亲你的嘴。   微博发布位置:无 发布时间:2022-04-02 18:40 发布工具:微博 weibo.com 点赞数:3668 转发数:21 评论数:623 url:https://weibo.cn/comment/LmBAje0MR


正在过滤转发微博 正在过滤转发微博 ------------------------------已获取ETF拯救世界(5687069307)的第124页微博------------------------------ 6条微博写入csv文件完毕,保存路径:C:\0-study\python_study\otherpy\weibo\weiboSpider\weibo\ETF拯救世界\5687069307.csv 6条微博写入txt文件完毕,保存路径:C:\0-study\python_study\otherpy\weibo\weiboSpider\weibo\ETF拯救世界\5687069307.txt 即将进行原创微博图片下载 Download progress: 100%|████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 6.67it/s] 原创微博图片下载完毕,保存路径:██████████████████████████████████ | 3/6 [00:00<00:00, 3.41it/s] C:\0-study\python_study\otherpy\weibo\weiboSpider\weibo\ETF拯救世界\img 即将进行视频下载 Download progress: 100%|████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<?, ?it/s] 视频下载完毕,保存路径:%| | 0/6 [00:00<?, ?it/s] C:\0-study\python_study\otherpy\weibo\weiboSpider\weibo\ETF拯救世界\video Progress: 9%|███████▍ | 124/1378 [21:21<3:00:04, 8.62s/it]H TTPSConnectionPool(host='weibo.cn', port=443): Max retries exceeded with url: /5687069307/profile?page=125 (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))) Traceback (most recent call last): File "C:\Program Files\Python310\lib\site-packages\urllib3\connectionpool.py", line 667, in urlopen self._prepare_proxy(conn) File "C:\Program Files\Python310\lib\site-packages\urllib3\connectionpool.py", line 932, in _prepare_proxy conn.connect() File "C:\Program Files\Python310\lib\site-packages\urllib3\connection.py", line 362, in connect self.sock = ssl_wrapsocket( File "C:\Program Files\Python310\lib\site-packages\urllib3\util\ssl.py", line 386, in ssl_wrap_socket return context.wrap_socket(sock, server_hostname=server_hostname) File "C:\Program Files\Python310\lib\ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "C:\Program Files\Python310\lib\ssl.py", line 1071, in _create self.do_handshake() File "C:\Program Files\Python310\lib\ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() ConnectionResetError: [WinError 10054] 远程主机强迫关闭了一个现有的连接。

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Program Files\Python310\lib\site-packages\requests\adapters.py", line 439, in send resp = conn.urlopen( File "C:\Program Files\Python310\lib\site-packages\urllib3\connectionpool.py", line 726, in urlopen retries = retries.increment( File "C:\Program Files\Python310\lib\site-packages\urllib3\util\retry.py", line 446, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='weibo.cn', port=443): Max retries exceeded with url: /5687069307/profile?pa ge=125 (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\0-study\python_study\otherpy\weibo\weiboSpider\weibo_spider\parser\util.py", line 25, in handle_html resp = requests.get(url, headers=headers) File "C:\Program Files\Python310\lib\site-packages\requests\api.py", line 76, in get return request('get', url, params=params, kwargs) File "C:\Program Files\Python310\lib\site-packages\requests\api.py", line 61, in request return session.request(method=method, url=url, kwargs) File "C:\Program Files\Python310\lib\site-packages\requests\sessions.py", line 530, in request resp = self.send(prep, send_kwargs) File "C:\Program Files\Python310\lib\site-packages\requests\sessions.py", line 643, in send r = adapter.send(request, kwargs) File "C:\Program Files\Python310\lib\site-packages\requests\adapters.py", line 510, in send raise ProxyError(e, request=request) requests.exceptions.ProxyError: HTTPSConnectionPool(host='weibo.cn', port=443): Max retries exceeded with url: /5687069307/profile?page =125 (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))) Progress: 9%|███████▍ | 124/1378 [21:26<3:36:46, 10.37s/it] 'NoneType' object has no attribute 'xpath' Traceback (most recent call last): File "C:\0-study\python_study\otherpy\weibo\weiboSpider\weibo_spider\spider.py", line 179, in get_weibo_info weibos, self.weibo_id_list, to_continue = PageParser( File "C:\0-study\python_study\otherpy\weibo\weiboSpider\weibo_spider\parser\page_parser.py", line 45, in init info = self.selector.xpath("//div[@class='c']") AttributeError: 'NoneType' object has no attribute 'xpath' 共爬取1000条原创微博 信息抓取完毕


PS C:\0-study\python_study\otherpy\weibo\weiboSpider>

minshengxinwen commented 1 year ago

我个人认为,。。。。。。
微博发布位置:无 发布时间:2022-05-24 10:28 发布工具:微博 weibo.com 点赞数:2485 转发数:63 评论数:248 url:https://weibo.cn/comment/LusAisxBv

------------------------------已获取ETF拯救世界(5687069307)的第81页微博------------------------------ 9条微博写入csv文件完毕,保存路径:C:\0-study\python_study\otherpy\weibo\weiboSpider\weibo\ETF拯救世界\5687069307.csv 9条微博写入txt文件完毕,保存路径:C:\0-study\python_study\otherpy\weibo\weiboSpider\weibo\ETF拯救世界\5687069307.txt 即将进行原创微博图片下载 Download progress: 100%|████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<?, ?it/s] 原创微博图片下载完毕,保存路径: | 0/9 [00:00<?, ?it/s] C:\0-study\python_study\otherpy\weibo\weiboSpider\weibo\ETF拯救世界\img 即将进行视频下载 Download progress: 100%|████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<?, ?it/s] 视频下载完毕,保存路径:%| | 0/9 [00:00<?, ?it/s] C:\0-study\python_study\otherpy\weibo\weiboSpider\weibo\ETF拯救世界\video Progress: 6%|████▉ | 81/1378 [15:53<4:23:50, 12.21s/it]H TTPSConnectionPool(host='weibo.cn', port=443): Max retries exceeded with url: /5687069307/profile?page=82 (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))) Traceback (most recent call last): File "C:\Program Files\Python310\lib\site-packages\urllib3\connectionpool.py", line 667, in urlopen self._prepare_proxy(conn) File "C:\Program Files\Python310\lib\site-packages\urllib3\connectionpool.py", line 932, in prepare_proxy conn.connect() File "C:\Program Files\Python310\lib\site-packages\urllib3\connection.py", line 362, in connect self.sock = ssl_wrap_socket( File "C:\Program Files\Python310\lib\site-packages\urllib3\util\ssl.py", line 386, in ssl_wrap_socket return context.wrap_socket(sock, server_hostname=server_hostname) File "C:\Program Files\Python310\lib\ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "C:\Program Files\Python310\lib\ssl.py", line 1071, in _create self.do_handshake() File "C:\Program Files\Python310\lib\ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() ConnectionResetError: [WinError 10054] 远程主机强迫关闭了一个现有的连接。

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Program Files\Python310\lib\site-packages\requests\adapters.py", line 439, in send resp = conn.urlopen( File "C:\Program Files\Python310\lib\site-packages\urllib3\connectionpool.py", line 726, in urlopen retries = retries.increment( File "C:\Program Files\Python310\lib\site-packages\urllib3\util\retry.py", line 446, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='weibo.cn', port=443): Max retries exceeded with url: /5687069307/profile?pa ge=82 (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\0-study\python_study\otherpy\weibo\weiboSpider\weibo_spider\parser\util.py", line 25, in handle_html resp = requests.get(url, headers=headers) File "C:\Program Files\Python310\lib\site-packages\requests\api.py", line 76, in get return request('get', url, params=params, kwargs) File "C:\Program Files\Python310\lib\site-packages\requests\api.py", line 61, in request return session.request(method=method, url=url, kwargs) File "C:\Program Files\Python310\lib\site-packages\requests\sessions.py", line 530, in request resp = self.send(prep, send_kwargs) File "C:\Program Files\Python310\lib\site-packages\requests\sessions.py", line 643, in send r = adapter.send(request, kwargs) File "C:\Program Files\Python310\lib\site-packages\requests\adapters.py", line 510, in send raise ProxyError(e, request=request) requests.exceptions.ProxyError: HTTPSConnectionPool(host='weibo.cn', port=443): Max retries exceeded with url: /5687069307/profile?page =82 (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, No info = self.selector.xpath("//div[@Class='c']") AttributeError: 'NoneType' object has no attribute 'xpath' 共爬取675条原创微博 信息抓取完毕

报错位置不定。

dataabc commented 1 year ago

感谢反馈。这个应该是速度太快不暂时限制了,过一段时间再看看。

huayueximujun commented 1 year ago

感谢反馈。这个应该是速度太快不暂时限制了,过一段时间再看看。

可以把那个"random_wait_pages","random_wait_seconds"的比较好的参数写到config或者md里面吗?

dataabc commented 1 year ago

@huayueximujun 我也不知道哪些参数最好,不过数越大,就表示速度越慢,越不容易被限制。

huayueximujun commented 1 year ago

@huayueximujun 我也不知道哪些参数最好,不过数越大,就表示速度越慢,越不容易被限制。

我有些不是特别的专业有的时候感觉是读取的现在正在操作那一页的速度太快了受到限制,如果我想每条微博爬取的时候都延时一下我应该修改哪个文件

dataabc commented 1 year ago

@huayueximujun 程序模拟的是网页版的微博搜索,对于某页结果,要么都获取,要么都不获取,不用按微博延时,如果非要这么做,可以修改search.py的最后一个方法,它是解析每天微博的,你可以在这加上延时。

huayueximujun commented 1 year ago

@huayueximujun 程序模拟的是网页版的微博搜索,对于某页结果,要么都获取,要么都不获取,不用按微博延时,如果非要这么做,可以修改search.py的最后一个方法,它是解析每天微博的,你可以在这加上延时。

感谢回复

ManutdGTA commented 1 year ago

"random_wait_pages": [1, 3], "random_wait_seconds": [20, 30],

我设置成这样

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 1 year ago

Closing as stale, please reopen if you'd like to work on this further.