centre-for-humanities-computing / odyCy

A general-purpose NLP pipeline for Ancient Greek
https://centre-for-humanities-computing.github.io/odyCy/
MIT License
18 stars 2 forks source link

pandas: hidden dependency for frequency_lemmatizer #21

Closed jankounchained closed 1 year ago

jankounchained commented 1 year ago

gonna be a problem if people download the model form hf hub & don't have pandas installed

jankounchained commented 1 year ago

table : List[TableEntry] (custom class based on TypedDict)

Usage: 141: pd.DataFrame.from_records(table) 179: match = self.table[self.table.form == _token["form"]] 218: table.to_json()

That's it. Just need a nl json reader & writter. And specify keys as string in the lookup.

jankounchained commented 1 year ago
x-tabdeveloping commented 1 year ago

I was thinking we could actually use a different format for the table if we are not using pandas: I think it's quite apparent that list lookup is going to be insanely slow for such a large vocabulary, so we should make a hashtable with the verbatim form being the key, and the values being lists of possible lemmas with part of speech tags and morphological features ordered according to frequency. We can leave all the entries to be a list because I highly doubt that there's gonna be more than 5 entries for each form, so it could even be the case that linear lookup is faster than SO something like this:

{
    ...
    "some_form": [
        {"lemma": "some_lemma", "frequency": 5000, "pos": "ADJ", "feature1": "value1", ...},
        {"lemma": "some_lemma", "frequency": 4000, "pos": "NOUN", "feature1": "value1", ...},
        ....
    ],
    ...
}

aka:

class Entry(TypedDict):
    lemma: str
    upos: str
    frequency: int
    # + all morphological features

LookupTable = Dict[str, List[Entry]]

Then our code for matching could be something along the following lines:

def max_freq_lemma(entries: List[Entry]) -> str:
    """Returns lemma with highest frequency from the given entries."""
    max_index = 0
    n_entries = len(entries)
    for index in range(1, n_entries):
        if entries[index]["frequency"] > entries[max_index]["frequency"]:
            max_index = index
    return entries[max_index]["lemma"]

def match_lemma(_token: Dict[str, str], table: FrequencyTable) -> Optional[str]:
    """Returns a lemma for a token if it can be found in the frequency table."""
    # Tries to find the entries associated with the token in the table
    match = table.get(_token["form"], [])
    if not match:
        return None
    # We go through all the properties to be matched
    for match_property in MATCH_ORDER:
        match_new = [entry for entry in entries if entry.get(match_property, "") == _token(match_property, "")]
        if not match_new:
            return max_freq_lemma(entries=match)
        match = match_new
    return max_freq_lemma(entries=match)

# then in the class
class FrequencyLemmatizer:
    ...
    def lemmatize(self, token: Token) -> str:
        # code abouit backoff
        ...
        orth = token.orth_.lower()
        _token = {
            "form": orth,
            "upos": token.pos_,
            **token.morph.to_dict(),
        }
        lemma = match_lemma(_token, table=self.table)
        if lemma is None:
            return backoff
        else:
            return lemma
x-tabdeveloping commented 1 year ago

Changes have been made, and I created a pull request, please close the issue after merging.

jankounchained commented 1 year ago

this is super done as of PR #22