dataabc / weibo-crawler

新浪微博爬虫,用python爬取新浪微博数据,并下载微博图片和微博视频
3.35k stars 748 forks source link

能否跳过这样爬取不到的user_id #96

Open PhoebusSi opened 4 years ago

PhoebusSi commented 4 years ago

Traceback (most recent call last): File "weibo.py", line 628, in get_one_page js = self.get_weibo_json(page) File "weibo.py", line 125, in get_weibo_json js = self.get_json(params) File "weibo.py", line 117, in get_json return r.json() File "/usr/lib/python3/dist-packages/requests/models.py", line 808, in json return complexjson.loads(self.text, kwargs) File "/usr/lib/python3.5/json/init.py", line 319, in loads return _default_decoder.decode(s) File "/usr/lib/python3.5/json/decoder.py", line 339, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.5/json/decoder.py", line 357, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) ^MProgress: 1%| | 10/1210 [00:33<1:44:42, 5.24s/it]Traceback (most recent call last): File "weibo.py", line 628, in get_one_page js = self.get_weibo_json(page) File "weibo.py", line 125, in get_weibo_json js = self.get_json(params) File "weibo.py", line 117, in get_json return r.json() File "/usr/lib/python3/dist-packages/requests/models.py", line 808, in json return complexjson.loads(self.text, kwargs) File "/usr/lib/python3.5/json/init.py", line 319, in loads return _default_decoder.decode(s) File "/usr/lib/python3.5/json/decoder.py", line 339, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.5/json/decoder.py", line 357, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) 有的user_id反复出现这样的信息,这个Progress的进度表很难加载,耗费很长的时间,能否加上监督出现这样的情况的机制,一旦出现就跳过?

dataabc commented 4 years ago

感谢反馈。

貌似json值是None,可能是爬的太快了。是不是如果出现就会连续出现。如果是这样,就和user_id没关系了。只能降速了。具体修改get_pages方法:

                    if (page -
                            page1) % random_pages == 0 and page < page_count:
                        sleep(random.randint(6, 10))
                        page1 = page
                        random_pages = random.randint(1, 5)

上面是每1到5页,等待6到10秒,你可以加快等待频率(减小random_pages)或增加等待(增大sleep内值)。

如果还有问题,欢迎继续讨论。

hahajason commented 4 years ago

那ERROR如果不是连续出现的话,是不是该id是无效的?那这种情况,能不能跳过呢该id紧接着爬下面的呢?

2020-07-19 21:01:07,561 - ERROR - weibo.py[:1047] - 'id' Traceback (most recent call last): File "C:\Users\ymh23\Desktop\weibo-crawler-master\weibo.py", line 1016, in get_pages self.print_user_info() File "C:\Users\ymh23\Desktop\weibo-crawler-master\weibo.py", line 538, in print_user_info logger.info(u'用户id:%s', self.user['id']) KeyError: 'id' 2020-07-19 21:01:07,574 - ERROR - weibo.py[:1099] - 'screen_name' Traceback (most recent call last): File "C:\Users\ymh23\Desktop\weibo-crawler-master\weibo.py", line 1097, in start self.update_user_config_file(self.user_config_file_path) File "C:\Users\ymh23\Desktop\weibo-crawler-master\weibo.py", line 980, in update_user_config_file info.append(self.user['screen_name']) KeyError: 'screen_name'

dataabc commented 4 years ago

@hahajason

如果id无效,程序会跳过无效id,继续执行。你上面的情况不是因为id无效,是因为速度太快被限制,没有获取到信息,程序读取没有得到的信息,所以出错。限制一段时间会自动解除,可以向上面的方法那样修改代码,以降低速度。