dataabc / weibo-crawler

新浪微博爬虫,用python爬取新浪微博数据,并下载微博图片和微博视频
3.33k stars 744 forks source link

Fix 微博的文本内容为空时,selector为None,会导致后续解析出错 #340

Closed BlueHtml closed 1 year ago

BlueHtml commented 1 year ago

当微博的文本内容为空时(json里"mblog"."text": " "),etree.HTML(text_body)的返回值为None,这会导致后续解析出错。

报错信息

2023-01-26 12:49:59,376 - ERROR - weibo.py[:842] - 'NoneType' object has no attribute 'xpath'
Traceback (most recent call last):
  File "D:\tmp\code\weibo-crawler\weibo.py", line 836, in get_one_weibo
    weibo = self.parse_weibo(weibo_info)
  File "D:\tmp\code\weibo-crawler\weibo.py", line 732, in parse_weibo
    weibo["article_url"] = self.get_article_url(selector)
  File "D:\tmp\code\weibo-crawler\weibo.py", line 633, in get_article_url
    text = selector.xpath("string(.)")
AttributeError: 'NoneType' object has no attribute 'xpath'

例如:生日当天自动发的生日微博,其内容为空(json里"mblog"."text": " "): image

返回的原始json"mblog"."text": " ",如下图: image

修复方式:在空字符串的末尾追加<hr>,此时会变成有效的html字符串,会被正确解析并返回html对象,后续即可正常使用。而<hr>是自结束的水平线,不会影响正常的数据解析。

dataabc commented 1 year ago

感谢贡献代码,非常详细的说明,是我疏忽了,没有考虑到这种情况,再次感谢,已merge。