Closed jankounchained closed 1 year ago
table : List[TableEntry]
(custom class based on TypedDict
)
Usage:
141: pd.DataFrame.from_records(table)
179: match = self.table[self.table.form == _token["form"]]
218: table.to_json()
That's it. Just need a nl json reader & writter. And specify keys as string in the lookup.
I was thinking we could actually use a different format for the table if we are not using pandas: I think it's quite apparent that list lookup is going to be insanely slow for such a large vocabulary, so we should make a hashtable with the verbatim form being the key, and the values being lists of possible lemmas with part of speech tags and morphological features ordered according to frequency. We can leave all the entries to be a list because I highly doubt that there's gonna be more than 5 entries for each form, so it could even be the case that linear lookup is faster than SO something like this:
{
...
"some_form": [
{"lemma": "some_lemma", "frequency": 5000, "pos": "ADJ", "feature1": "value1", ...},
{"lemma": "some_lemma", "frequency": 4000, "pos": "NOUN", "feature1": "value1", ...},
....
],
...
}
aka:
class Entry(TypedDict):
lemma: str
upos: str
frequency: int
# + all morphological features
LookupTable = Dict[str, List[Entry]]
Then our code for matching could be something along the following lines:
def max_freq_lemma(entries: List[Entry]) -> str:
"""Returns lemma with highest frequency from the given entries."""
max_index = 0
n_entries = len(entries)
for index in range(1, n_entries):
if entries[index]["frequency"] > entries[max_index]["frequency"]:
max_index = index
return entries[max_index]["lemma"]
def match_lemma(_token: Dict[str, str], table: FrequencyTable) -> Optional[str]:
"""Returns a lemma for a token if it can be found in the frequency table."""
# Tries to find the entries associated with the token in the table
match = table.get(_token["form"], [])
if not match:
return None
# We go through all the properties to be matched
for match_property in MATCH_ORDER:
match_new = [entry for entry in entries if entry.get(match_property, "") == _token(match_property, "")]
if not match_new:
return max_freq_lemma(entries=match)
match = match_new
return max_freq_lemma(entries=match)
# then in the class
class FrequencyLemmatizer:
...
def lemmatize(self, token: Token) -> str:
# code abouit backoff
...
orth = token.orth_.lower()
_token = {
"form": orth,
"upos": token.pos_,
**token.morph.to_dict(),
}
lemma = match_lemma(_token, table=self.table)
if lemma is None:
return backoff
else:
return lemma
Changes have been made, and I created a pull request, please close the issue after merging.
this is super done as of PR #22
gonna be a problem if people download the model form hf hub & don't have pandas installed