rss订阅：P标签不会单行显示

ifwlzs commented 7 months ago

环境

nonebot-bison 版本：0.9.2
nonebot 版本：2.2.1
安装方式：1（以下方式的一种或者其他方式）
1. 通过 nb-cli 安装
2. 使用 poetry/pdm 等现代包管理器安装
3. 通过 pip install 安装
4. 克隆或下载项目直接使用
操作系统：windows 2009 (19045.4710)

问题

rss订阅中P标签的文字不会单行显示

日志

请在这里粘贴你的日志

[ √ ] 我搜索过了 issue，但是并没有发现过与我类似的问题
[ √ ] 我确认在日志中去掉了敏感信息

suyiiyii commented 5 months ago

问题的原因在第 68 行这里，用 bs 库获取 html 的文本的时候丢失了<p>标签等格式信息 https://github.com/MountainDash/nonebot-bison/blob/1c753f7a2c38be972c9fa125cf7936b4fb7a3888/nonebot_bison/platform/rss.py#L65-L69

In [23]: doc = """
    ...: terterthv<p>cxiobjhoijeraoi</p>jgiojoidfgjk<p>ldfjgioj</p>bvcxninclin
    ...: """

In [24]: soup = bs(doc,"html.parser")

In [25]: soup.get_text()
Out[25]: '\nterterthvcxiobjhoijeraoijgiojoidfgjkldfjgiojbvcxninclin\n'

bs 获取文本换行逻辑

似乎是根据 html 的换行来进行处理的 From [https://www.crummy.com/software/BeautifulSoup/bs4/doc/]() ```python html_doc = """The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

""" print(soup.get_text()) # The Dormouse's story # # The Dormouse's story # # Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and they lived at the bottom of a well. # # ... ``` ```python In [14]: html_doc = """The Dormouse's story

The Dormouse's story

< ...: p class="story">Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.

...""" In [15]: soup = BeautifulSoup(html_doc, 'html.parser');print(soup.get_text()) The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.... ```

我想到两种解决方法

手动预处理 html

获取描述后先手动进行预处理，例如将<p>替换为<br>，再将<br>替换为\n 再将处理过后的 html 丢给 bs 处理，获得带有格式的文本

html2text

这个库可以把 html 转换成 markdown

In [27]: html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><
    ...: p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" cl
    ...: ass="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http:/
    ...: /example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</
    ...: p>"""

In [28]: h = html2text.HTML2Text()

In [29]: h.ignore_links = True

In [30]: print(h.handle(html_doc))
**The Dormouse's story**

Once upon a time there were three little sisters; and their names
wereElsie,Lacie andTillie;and they lived at the bottom of a well.

...

In [31]: html_doc = """<html><body><p>cxiobjhoijeraoi</p>jgiojoidfgjk<p>ldfjgioj</p>bvcxninclin</body></html>"""

In [32]: h = html2text.HTML2Text()

In [33]: h.ignore_links = True

In [34]: print(h.handle(html_doc))
cxiobjhoijeraoi

jgiojoidfgjk

ldfjgioj

bvcxninclin

经过处理可以获得较为美观的纯文本

@AzideCupric

felinae98 commented 5 months ago

我记得weibo还是什么地方也有类似（手撮的）处理 html 的文本，统一处理一下？

MountainDash / nonebot-bison

rss订阅：P标签不会单行显示 #528

环境

问题

日志

手动预处理 html

html2text