Closed myshero closed 6 months ago
修复了:无法正确获取需要“展开”的长文微博
优化了:如果长文微博中有换行则保留格式。长文微博文本中的标签被替换为\n
def get_long_weibo(self): """获取长原创微博""" try: for i in range(5): self.selector = handle_html(self.cookie, self.url) if self.selector is not None: info_div = self.selector.xpath("//div[@class='c' and @id='M_']")[0] info_span = info_div.xpath("//span[@class='ctt']")[0] # 1. 获取 info_span 中的所有 HTML 代码作为字符串 html_string = etree.tostring(info_span, encoding='unicode', method='html') # 2. 将 <br> 替换为 \n html_string = html_string.replace('<br>', '\n') # 3. 去掉所有 HTML 标签,但保留标签内的有效文本 new_content = fromstring(html_string).text_content() # 4. 替换多个连续的 \n 为一个 \n new_content = re.sub(r'\n+', '\n', new_content) weibo_content = handle_garbled(new_content) if weibo_content is not None: return weibo_content sleep(random.randint(6, 10)) except Exception: logger.exception(u'网络出错')
结果示例:
{ "id": "Obuk4oIaU", "user_id": "", "content": ":2024年04月26日,星期五\n今天证实了我们所说,情绪仍处于上升期,即使也要多看多了解。", "article_url": "", "original_pictures":"无", "retweet_pictures": null, "original": true, "video_url": "无", "publish_place": "无", "publish_time": "2024-04-26 12:11", "publish_tool": "微博网页版", "up_num": 3, "retweet_num": 0, "comment_num": 0 }
感谢贡献代码。非常好的优化,可以让长微博更整洁,已merge。
修复了:无法正确获取需要“展开”的长文微博
优化了:如果长文微博中有换行则保留格式。长文微博文本中的
标签被替换为\n
结果示例: