Fixing the parsing of scientific notation and non-ASCII characters

Because PDF encodings are a mess, parsing often leads to poor results with non-ASCII text of all kinds. For example, scientific notation in the form 10^3 is typically parsed as "103" by PyMuPDF and as "10 3" by grobid, which is really unfortunate when one tries to extract quantitative information (though the latter is fixable by post-processing). Similarly, non-ASCII characters are a bit of an RNG. For example, in one paper I was looking at IFN-β was parsed as IFN-\x02 by PyMuPDF and IFN-␤ by grobid.

A few questions on this: 1) I think grobid results are generally more fixable than PyMuPDF results. The newest paper for this repo uses grobid for some of its results - are you planning to add your own implementation of grobid parsing to this repo? 2) As far as I can tell, there's currently no post-processing for the PyMuPDF parsing. In the recent paper, I there's also no mention of post-processing for grobid results. Do you currently plan to implement post-processing to fix some of this?

Some of these things would be relatively easy to fix. For example, when using grobid, scientific number notation could be rescued with something likes this:

def rescue_scientific_notation(text):
    # replace space between 10 and the exponent with ^
    text = re.sub(r'([*x]?)(10) (\d*)', r'\1\2^\3', text)
    return text

Similarly, weird non-Ascii encodings could probably be rescued by asking ChatGPT to guess the correct character (this is a really crude example implementation):

from openai import OpenAI
import asyncio

def get_client():
    api_key = "an-api-key"
    client = OpenAI(api_key=api_key)
    return client

def extract_gpt_response(response):
    response = response.choices[0].message.parsed
    return response

async def fetch_structured_response(query, model, response_format, system_prompt, client):
    response = client.beta.chat.completions.parse(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            query
        ],
        response_format=response_format
    )
    return response

async def get_structured_responses(queries, model, response_format, system_prompt, client):
    # Run all requests concurrently
    responses = await asyncio.gather(*(fetch_structured_response(query, model, response_format, system_prompt, client) for query in queries))
    responses = [extract_gpt_response(r) for r in responses]

    return responses

def return_queries(contents):
    return [{"role": "user", "content": content} for content in contents]

class TranslationDictionary(BaseModel):
    original_char: list[str]
    new_char: list[str]

system_prompt_conversion ="""
You are looking at a text from a publication that contains potentially falsely encoded non-ASCII characters.
You are given examples in the format character, ord(char), hex(ord(char)), and the 20 characters before and after the character.
Your task is to create a python dictionary that maps each non-ASCII character to its correct ASCII equivalent.
Note that you can also match to the original character. Only do so if it genuinely makes sense in context.
If you cannot find a match, replace the character with U+2205.
"""

async def preclean_parsed_text(text):
    ascii_chars = [chr(i) for i in range(128)]
    greek_chars = [chr(i) for i in range(945, 970)]
    other_known_chars = ["©", "±", "°", "′", "Δ", "ï", "ö", "ü", "ä", "\n", "\r", "\t"]
    known_chars = ascii_chars + greek_chars + other_known_chars

    # fix some known replacements
    known_replacements = {"×": "x", "ï": "i", "  ": " ", "u ¨": "ü", "a ¨": "ä", "o ¨": "ö"}
    text = replace_text(text, known_replacements)

    # rescue scientific notation (only works for grobid parsing)
    text = rescue_scientific_notation(text)

    # ------------------------------------------------------------
    # try to clean up unusual characters
    unusual_chars = []
    for char_index, char in enumerate(text):
        if char not in known_chars:
            unusual_chars.append((char, ord(char), hex(ord(char)), text[char_index-20:char_index+20]))
    # order by the ord() value
    unusual_chars.sort(key=lambda x: x[1])

    # keep at most 10 examples per char
    unusual_chars_filtered = []
    curr_char = ""
    i = 0
    for char, e1, e2, e3 in unusual_chars:  
        if char != curr_char:
            curr_char = char
            i = 0
        if i < 10:
            unusual_chars_filtered.append((char, e1, e2, e3))
            i += 1

    queries = return_queries([str(unusual_chars_filtered)])
    responses = await get_structured_responses(
        queries, model="gpt-4o",
        response_format=TranslationDictionary,
        system_prompt=system_prompt_conversion, client=openai_client)
    response = responses[0]

    # create the replacement dictionary
    original_chars = response.original_char
    new_chars = response.new_char
    replacement_dict = {o: n for o, n in zip(original_chars, new_chars)}

    # replace the text
    text = replace_text(text, replacement_dict)

    return text

This correctly fixes most of the worst offenders and the interferon names:

replacement_dict = 
{'ϩ': '+', 'Ϫ': '-', 'ϫ': 'x', 'Ϯ': '+', 'Ͻ': '<', 'Ͼ': '>', 'Ј': "'", '؉': '+', '؊': '-', 'ء': 'U+2205', '\n': '\n', '†': '\n', '‡': '\n', '␣': 'α', '␤': 'β', '␥': 'γ'}

Are there plans to implement parsing post-processing?

Future-House / paper-qa

Fixing the parsing of scientific notation and non-ASCII characters #744