SpiderClub / weibospider

:zap: A distributed crawler for weibo, building with celery and requests.
MIT License
4.81k stars 1.21k forks source link

爬关键词搜索失败 #196

Closed xwt0016 closed 4 years ago

xwt0016 commented 4 years ago

因为ip被微博封了,所以加了ip代理,login运行成功了,但是python first_task_execution/search 之后结果是这样的,weibo_data里也没有出现任何数据。page_get/basic.py里的get_page的need_proxy已经改成=True了 [2020-03-07 21:34:12,974: INFO/MainProcess] Received task: tasks.search.search_keyword[d652d4ea-826a-488f-a1aa-eaf52d9d8363]
2020-03-07 21:34:12 - crawler - INFO - We are searching keyword "武汉红十字会" [2020-03-07 21:34:12,976: INFO/ForkPoolWorker-1] We are searching keyword "武汉红十字会" 2020-03-07 21:34:12 - crawler - INFO - the crawling url is http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1 [2020-03-07 21:34:12,979: INFO/ForkPoolWorker-1] the crawling url is http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1 2020-03-07 21:37:08 - crawler - WARNING - Excepitons are raised when crawling http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1.Here are details:HTTPConnectionPool(host='183.164.228.73', port=49691): Max retries exceeded with url: http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1 (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd699739ba8>: Failed to establish a new connection: [Errno 110] Connection timed out',))) [2020-03-07 21:37:08,589: WARNING/ForkPoolWorker-1] Excepitons are raised when crawling http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1.Here are details:HTTPConnectionPool(host='183.164.228.73', port=49691): Max retries exceeded with url: http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1 (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd699739ba8>: Failed to establish a new connection: [Errno 110] Connection timed out',))) 2020-03-07 21:37:08 - crawler - ERROR - failed to crawl http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1,here are details:an integer is required (got type str), stack is File "/home/xwt/Desktop/weibospider-temp_verification/decorators/decorators.py", line 17, in time_limit return func(*args, **kargs)

[2020-03-07 21:37:08,590: ERROR/ForkPoolWorker-1] failed to crawl http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1,here are details:an integer is required (got type str), stack is File "/home/xwt/Desktop/weibospider-temp_verification/decorators/decorators.py", line 17, in time_limit return func(*args, **kargs)

2020-03-07 21:37:08 - crawler - WARNING - No search result for keyword 武汉红十字会, the source page is [2020-03-07 21:37:08,591: WARNING/ForkPoolWorker-1] No search result for keyword 武汉红十字会, the source page is [2020-03-07 21:37:08,592: INFO/ForkPoolWorker-1] Task tasks.search.search_keyword[d652d4ea-826a-488f-a1aa-eaf52d9d8363] succeeded in 175.61601991499992s: None

Max retries exceeded with url这是因为代理ip失效太快了嘛

thekingofcity commented 4 years ago

ProxyError('Cannot connect to proxy.',

检查代理问题

xwt0016 commented 4 years ago

好的感谢,另外请问worker是不是不能在root用户下启动啊

thekingofcity commented 4 years ago

可以, 只是celery会提示不推荐

xwt0016 commented 4 years ago

代理是能拿到的而且登陆的时候用代理就没问题,为什么搜索就出现代理问题呢,如果page_get里不用代理话也是什么都拿不到,直接has been crawled,讲道理不登录的情况下也应该能拿到第一页的数据才对啊

2020-03-08 15:07:05 - other - INFO - Login successful! The login account is 17507424089 2020-03-08 15:07:16 - other - INFO - Login successful! The login account is 18574774032 2020-03-08 15:07:42 - crawler - INFO - the crawling url is http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&scope=ori&suball=1&page=1 2020-03-08 15:07:56 - crawler - INFO - keyword 武汉红十字会 has been crawled in this turn 2020-03-08 15:08:16 - other - INFO - Login successful! The login account is qthku6x0@duoduo.cafe 2020-03-08 15:08:37 - other - INFO - Login successful! The login account is tm5vdlac@anjing.cool 2020-03-08 15:08:55 - other - INFO - Login successful! The login account is 73iadkvm@duoduo.cafe

OneCodeMonkey commented 4 years ago

看下你的账号能不能正常登陆,在本机上试

xwt0016 commented 4 years ago

是可以正常登录的,之前买了一批需要手机验证,特地重新买了一批,我感觉是不是微博搜索的cookie跟微博的cookie不一样。我之前用的temp_verification那版,今天我登不上超级鹰了,但云打码又可以用了,我用1.7.2试了一下,报错是这样的 2020-03-09 09:41:32 - other - INFO - Login successful! The login account is 17680282715 [2020-03-09 09:41:32,622: INFO/ForkPoolWorker-1] Login successful! The login account is 17680282715 2020-03-09 09:41:32 - crawler - INFO - the crawling url is https://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1 [2020-03-09 09:41:32,897: INFO/ForkPoolWorker-1] the crawling url is https://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1 [2020-03-09 09:41:33,013: WARNING/ForkPoolWorker-1] /home/xwt/miniconda3/envs/py35/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning) [2020-03-09 09:41:36,520: INFO/MainProcess] Received task: tasks.login.login_task[b8fe2c55-0064-4305-bd92-fcd086c8f2a8]
2020-03-09 09:41:40 - other - INFO - Login successful! The login account is qthku6x0@duoduo.cafe [2020-03-09 09:41:40,106: INFO/ForkPoolWorker-2] Login successful! The login account is qthku6x0@duoduo.cafe [2020-03-09 09:41:48,453: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find' [2020-03-09 09:41:48,454: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'img' [2020-03-09 09:41:48,456: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find' [2020-03-09 09:41:48,458: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find' [2020-03-09 09:41:48,460: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find' [2020-03-09 09:41:48,461: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find' [2020-03-09 09:41:48,461: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'get' [2020-03-09 09:41:48,463: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find' [2020-03-09 09:41:48,463: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'get' [2020-03-09 09:41:48,464: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'img' [2020-03-09 09:41:48,465: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find' [2020-03-09 09:41:48,466: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find' [2020-03-09 09:41:48,467: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find' [2020-03-09 09:41:48,469: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find' [2020-03-09 09:41:48,469: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'img' [2020-03-09 09:41:48,471: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find' [2020-03-09 09:41:48,474: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find' [2020-03-09 09:41:48,475: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find' [2020-03-09 09:41:48,476: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find' [2020-03-09 09:41:48,477: ERROR/ForkPoolWorker-1] 'NoneType' object has no attribute 'find' 2020-03-09 09:41:48 - crawler - INFO - keyword 武汉红十字会 has been crawled in this turn [2020-03-09 09:41:48,477: INFO/ForkPoolWorker-1] keyword 武汉红十字会 has been crawled in this turn 2020-03-09 09:41:54 - other - INFO - Login successful! The login account is yax9gheb@anjing.cool

OneCodeMonkey commented 4 years ago

这个问题碰到过,首先账号如果手动试验,需要手机号解封,那么即使登陆成功也是请求不到搜索页内容。如果账号没问题,也没有手机号解封,登陆也成功,还拿不到搜索页内容,很可能是 IP 被限制了。两种都有。

xwt0016 commented 4 years ago

刚刚换了种api 限制5次1秒的代理,抓了30条,然后又报代理错误了,我放弃了,我主要还是想要搜索到的微博的转发跟评论。我另外去抓了搜索的微博,导进weibo_data然后爬评论跟转发,可以运行。感谢大佬们的回复