Why use unidecode, and why only on non-html content?

intarga commented 3 weeks ago

I was trying to track down a discrepancy between latin2shaw's handling of non-html and html content, "don’t" being transliterated incorrectly as "𐑛𐑵n’t" in an html document, but correctly as "𐑛𐑴𐑯𐑑" on it's own. I realised this is because spacy doesn't understand words with U+2019 "Right single quotation mark" as contractions, but does handle the ASCII apostrophe U+0027 that unidecode converts it to. +1 to using unidecode!

To get latin2shaw to handle these correctly in an html document, I tried feeding the text on the html branch through unidecode too, but ended up with some unintended consequences... Let's take another example: "My name is Ingebjørg 😇". Without unidecode, this comes out nicely as "𐑥𐑲 𐑯𐑱𐑥 𐑦𐑟 Ingebjørg✢ 😇", but with unidecode "𐑥𐑲 𐑯𐑱𐑥 𐑦𐑟 Ingebjorg✢". Oh no... -1 to using unidecode.

I would like to come up with something that solves the problems unidecode does, without introducing its downsides, but I suspect I may be lacking the information to do this well, so I am here asking for more info. Hence:

1) What was the original reason to introduce unidecode? Are there issues it solves other than this right single quotation and apostrophe issue I stumbled on?

2) Is there a reason it isn't used in the html branch of latin2shaw? It seems to me that any issues it addresses ought to be addressed in both branches.

Shavian-info commented 3 weeks ago

The latin2shaw script is part of a broader suite of scripts I use locally. They are mostly hacked together for my own use and as part of learning Python. I have a separate script for cleaning up HTML before passing it to latin2shaw. My scripts aren't really worth uploading, but I've included the code below for how I clean HTML files. The script is called from a flask application, using the format http://127.0.0.1:5000/url2shaw?url= insert URL here

import requests
from urllib.parse import urlparse
import re
from unidecode import unidecode
from bs4 import BeautifulSoup

def clean_html(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        print(e)
        return f"{e}"

    response.encoding = 'utf-8'
    text = unidecode(response.text)

    parsed_url = urlparse(url)
    stripped_path_elements = re.split("(/)", parsed_url.path)[0:-1]
    stripped_path = ''.join([str(i) for i in stripped_path_elements])

    soup = BeautifulSoup(text, features="html.parser")
    text = soup.prettify()
    text = text.replace('''<head>''', f'''<head>\n<base href="{parsed_url.scheme}://{parsed_url.netloc}">\n<link href="static/url2shaw.css" rel="stylesheet" />''')

    def replace_src(match):
        reftype = match.group(1)
        initial_char = match.group(2)
        return f'''{reftype}="{parsed_url.scheme}://{parsed_url.netloc}/{stripped_path}{initial_char}'''

    pattern = r'''(src)="([^/])(?!ttp)'''
    text = re.sub(pattern, replace_src, text)

    text = re.sub(r'''<a([^\>]+)http([^\>]+)\>''', r'''<a \1http://127.0.0.1:5000/url2shaw?url=http\2\>''', text)
    text = re.sub(r'''<a([^\>]+)href="/([^\>]+)\>''', r'''<a\1href="http://127.0.0.1:5000/url2shaw?url=''' + parsed_url.scheme + r'''://''' + parsed_url.netloc + r'''/\2\>''', text)

    return text

intarga commented 3 weeks ago

I see, so you do use unidecode on both.

Since you don't mention any other reasons for using unidecode, and I haven't encountered any apart from the one I mentioned, I'm going to assume it's the only thing unidecode is needed for. In that case I think this simple regex on the latin text solves the problem better:

text_part = re.sub(r"\b’\b", "\'", text_part)

This preserves non-ASCII characters, and removes the need for unidecode, smartypants, and the beautifulsoup parse, saving us a bunch of code and 2 dependencies (not 3 because I think we should keep beautiful soup for another reason - simplifying the html processing).

Once the packaging PR gets merged, I'll post a PR for this with relevant test cases.

Shavian-info / readlex

Why use unidecode, and why only on non-html content? #73