Open intarga opened 3 weeks ago
The latin2shaw script is part of a broader suite of scripts I use locally. They are mostly hacked together for my own use and as part of learning Python. I have a separate script for cleaning up HTML before passing it to latin2shaw. My scripts aren't really worth uploading, but I've included the code below for how I clean HTML files. The script is called from a flask application, using the format http://127.0.0.1:5000/url2shaw?url=
insert URL here
import requests
from urllib.parse import urlparse
import re
from unidecode import unidecode
from bs4 import BeautifulSoup
def clean_html(url):
try:
response = requests.get(url)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(e)
return f"{e}"
response.encoding = 'utf-8'
text = unidecode(response.text)
parsed_url = urlparse(url)
stripped_path_elements = re.split("(/)", parsed_url.path)[0:-1]
stripped_path = ''.join([str(i) for i in stripped_path_elements])
soup = BeautifulSoup(text, features="html.parser")
text = soup.prettify()
text = text.replace('''<head>''', f'''<head>\n<base href="{parsed_url.scheme}://{parsed_url.netloc}">\n<link href="static/url2shaw.css" rel="stylesheet" />''')
def replace_src(match):
reftype = match.group(1)
initial_char = match.group(2)
return f'''{reftype}="{parsed_url.scheme}://{parsed_url.netloc}/{stripped_path}{initial_char}'''
pattern = r'''(src)="([^/])(?!ttp)'''
text = re.sub(pattern, replace_src, text)
text = re.sub(r'''<a([^\>]+)http([^\>]+)\>''', r'''<a \1http://127.0.0.1:5000/url2shaw?url=http\2\>''', text)
text = re.sub(r'''<a([^\>]+)href="/([^\>]+)\>''', r'''<a\1href="http://127.0.0.1:5000/url2shaw?url=''' + parsed_url.scheme + r'''://''' + parsed_url.netloc + r'''/\2\>''', text)
return text
I see, so you do use unidecode on both.
Since you don't mention any other reasons for using unidecode, and I haven't encountered any apart from the one I mentioned, I'm going to assume it's the only thing unidecode is needed for. In that case I think this simple regex on the latin text solves the problem better:
text_part = re.sub(r"\b’\b", "\'", text_part)
This preserves non-ASCII characters, and removes the need for unidecode, smartypants, and the beautifulsoup parse, saving us a bunch of code and 2 dependencies (not 3 because I think we should keep beautiful soup for another reason - simplifying the html processing).
Once the packaging PR gets merged, I'll post a PR for this with relevant test cases.
I was trying to track down a discrepancy between latin2shaw's handling of non-html and html content, "don’t" being transliterated incorrectly as "𐑛𐑵n’t" in an html document, but correctly as "𐑛𐑴𐑯𐑑" on it's own. I realised this is because spacy doesn't understand words with U+2019 "Right single quotation mark" as contractions, but does handle the ASCII apostrophe U+0027 that unidecode converts it to. +1 to using unidecode!
To get latin2shaw to handle these correctly in an html document, I tried feeding the text on the html branch through unidecode too, but ended up with some unintended consequences... Let's take another example: "My name is Ingebjørg 😇". Without unidecode, this comes out nicely as "𐑥𐑲 𐑯𐑱𐑥 𐑦𐑟 Ingebjørg✢ 😇", but with unidecode "𐑥𐑲 𐑯𐑱𐑥 𐑦𐑟 Ingebjorg✢". Oh no... -1 to using unidecode.
I would like to come up with something that solves the problems unidecode does, without introducing its downsides, but I suspect I may be lacking the information to do this well, so I am here asking for more info. Hence:
1) What was the original reason to introduce unidecode? Are there issues it solves other than this right single quotation and apostrophe issue I stumbled on?
2) Is there a reason it isn't used in the html branch of latin2shaw? It seems to me that any issues it addresses ought to be addressed in both branches.