RedHotUnicorn / PDB-tools

tools for managment PDB
0 stars 0 forks source link

try to swithc from trafilatura #17

Closed RedHotUnicorn closed 6 months ago

RedHotUnicorn commented 7 months ago

import urllib.request from inscriptis import get_text from inscriptis.model.config import ParserConfig

url = "https://t.me/zettelkasten_ch/549?embed=1&mode=tme" url = "https://habr.com/ru/companies/ncloudtech/articles/806771/" html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html,ParserConfig(display_links=True)) print(text)

RedHotUnicorn commented 6 months ago

switched to markdownify

also used morss readability to fetch readable content