dataabc / weiboSpider

新浪微博爬虫,用python爬取新浪微博数据
8.44k stars 1.98k forks source link

issues_bug_574 无法匹配获取微博长文,尝试修复 #575

Closed myshero closed 6 months ago

myshero commented 6 months ago

修复了:无法正确获取需要“展开”的长文微博

优化了:如果长文微博中有换行则保留格式。长文微博文本中的
标签被替换为\n

def get_long_weibo(self):
        """获取长原创微博"""
        try:
            for i in range(5):
                self.selector = handle_html(self.cookie, self.url)
                if self.selector is not None:
                    info_div = self.selector.xpath("//div[@class='c' and @id='M_']")[0]
                    info_span = info_div.xpath("//span[@class='ctt']")[0]
                    # 1. 获取 info_span 中的所有 HTML 代码作为字符串
                    html_string = etree.tostring(info_span, encoding='unicode', method='html')
                    # 2. 将 <br> 替换为 \n
                    html_string = html_string.replace('<br>', '\n')
                    # 3. 去掉所有 HTML 标签,但保留标签内的有效文本
                    new_content = fromstring(html_string).text_content()
                    # 4. 替换多个连续的 \n 为一个 \n
                    new_content = re.sub(r'\n+', '\n', new_content)
                    weibo_content = handle_garbled(new_content)
                    if weibo_content is not None:
                        return weibo_content
                sleep(random.randint(6, 10))
        except Exception:
            logger.exception(u'网络出错')

结果示例:

        {
            "id": "Obuk4oIaU",
            "user_id": "",
            "content": ":2024年04月26日,星期五\n今天证实了我们所说,情绪仍处于上升期,即使也要多看多了解。",
            "article_url": "",
            "original_pictures":"无",
            "retweet_pictures": null,
            "original": true,
            "video_url": "无",
            "publish_place": "无",
            "publish_time": "2024-04-26 12:11",
            "publish_tool": "微博网页版",
            "up_num": 3,
            "retweet_num": 0,
            "comment_num": 0
        }
dataabc commented 6 months ago

感谢贡献代码。非常好的优化,可以让长微博更整洁,已merge。