Cleaner to normalize unicode glyphs?

jcushman commented 3 years ago

Do you all have any insight about cleaning text for non-ascii characters? We have two parts of this in play for CAP:

Quotes and dashes (and maybe others?) can come in as curly quotes or mdashes or whatever. Some set of replacements should probably be made on our text like ‘ ’ ´ “ ” – -> ' ' ' " " -; don't know if there's a good complete list. This one probably applies to most text.
OCR'd cites can come in with accents and umlauts and such, so for OCR'd English text we probably want to replace é and ü and so on with English-language ascii lookalikes. This might be less generally applicable.

I'm thinking of throwing everything through https://pypi.org/project/Unidecode/ , which I think will do both of those things:

> print(unidecode('‘’´“”–éü'))
'''""-eu

I haven't measured performance yet though; might be overkill. Any other suggestions? And does some form of this want to make it into the built-in eyecite cleaners? That part doesn't matter for CAP's purposes, just curious if it'd be helpful.

mlissner commented 3 years ago

We have one or two OCR fixes that we automatically apply, so I'm definitely interested in a general solution if you head in that direction. Years ago, I assumed there must be...something. Some word list or cleanup tools or something that some academic or open source person or somebody had created, but I came up with absolutely nothing. I sort of concluded that the reason there was nothing was because the best OCR tools already have this built in, but I only half believe that.

Here's the function I made. One or two fixes was right:

def cleanup_ocr_text(txt: str) -> str:
    """Do some basic cleanup to make OCR text better.

    Err on the side of safety. Don't make fixes that could cause other issues.

    :param txt: The txt output from the OCR engine.
    :return: Txt output, cleaned up.
    """
    simple_replacements = (
        ("Fi|ed", "Filed"),
        (" Il ", " II "),
    )
    for replacement in simple_replacements:
        txt = txt.replace(replacement[0], replacement[1])
    return txt

This does feel out of scope for eyecite though, no?

jcushman commented 3 years ago

One citation-specific angle here that might mean this wants to move upstream from CAP to eyecite eventually is that off-the-shelf OCR software seems to be particularly typo-prone in citation strings relative to the rest of the text, because it doesn't have a language model trained on legal citations to predict what the character is supposed to be. So it seems to be much more likely to get reporter strings wrong than other strings -- others I've noticed just flipping through cases are R2d -> P.2d, Yt. -> Vt., la. -> Ia., Pae. -> Pac., 5.Ct. -> S.Ct.. I also believe I've seen speckles on the page be more likely to turn into umlauts and accents and colons and such within citations, though I don't have examples handy.

Probably best to just let this simmer in CAP and we'll see if our collection of edge cases adds up to anything coherent.

jcushman commented 3 years ago

Separately I think a punctuation-normalizing filter probably does want to be in eyecite, since the algorithm depends on matching ascii punctuation like quotes and dashes.

mlissner commented 3 years ago

That's interesting. I bet turning all umlauts into u's would be a net benefit. I guess I could also see some of these common citation OCR misses (R2d for example), showing up in reporters_db somehow. Seems messy though.

devlux76 commented 2 years ago

Be careful when normalizing unicode. It gets rid of things that can be very important such as § (Sec.) and §§ (Secs.) So before doing that, it might be good to parse the text for legal glyphs and convert them to their English equivalents.

freelawproject / eyecite

Cleaner to normalize unicode glyphs? #50