Update various ranking sources and revise overall regression formula for crk search

UAlbertaALTLab / morphodict

The Language Independent Intelligent Dictionary

https://morphodict.readthedocs.io/

Apache License 2.0

23 stars 11 forks source link

Update various ranking sources and revise overall regression formula for crk search #1042

Closed aarppe closed 1 year ago

aarppe commented 2 years ago

We've now got a number of updated resources for integration with the overall regression formula for crk relevance ranking (calculations laid out in #1040 and #1041):

Corpus-based word-form/analysis counts

crk/generated/ahenakew_wolfart_bloomfield.fst+cg.freq-sorted.txt

Dictionary-based morpheme-frequency counts and morpheme counts for lexical entries in CW

crk/generated/CW_aggregate_morpheme_log_freqs.tsv

Core vocabulary occurrences

crk/generated/crk_glossaries_aggregate_vocab.tsv

These could be used to update the source for the regression formula, as well as potentially revise the formula by adding new explanatory variables (in particular lemma morpheme count).

aarppe commented 2 years ago

Looking at the search ranking results right now, I'm seeing this issue as follows.

We have three basic types of information that can be used for ranking the results:

Frequency/centrality: core vocabulary score > corpus lemma frequency > morpheme dictionary frequency - core vocabulary, like mîcisow, mîciw, mowêw, should be prioritized, even if the semantic match isn't the closest, since that is what one expects
Semantic similarity: cosine vector distance
Part-of-speech match: when English phrase analysis provides a result, match primarily with specific word-class and secondarily with general word-class. If not, use 1 and 2 for ranking. If the user types in a phrase that can be analyzed e.g. as V+TA, then the topmost results should be TA verbs, not TI or AI ones. One could even consider restricting the results to specific word-class matches, but alternatively could leave other matches if these (mostly) appear after the specific word-class matched ones.

aarppe commented 2 years ago

[edit: modified weights] Following up on the above, after the glossary_count andmorpheme_dictionary_frequency ranking have been standardized for range [0, 1], I'd use the following weightings for producing the aggregate relevance ranking (since lemma_corpus_frequency seemed to have a negligible impact, I'd leave that out, keeping only glossary_count and morpheme_dictionary_frequency:

a. analysis_feature_match: 3 b. glossary_count: 2 c. morpheme_dictionary_frequency: 1 d. cosine_vector_distance: -2 * (1-x) e. target_language_keyword_match: 1 (as integers)

When using pos_match, I cannot see why we'd retain the logical is_espt_result.

Later on, we'd want to fit these weights empirically, of course.

aarppe commented 2 years ago

As a follow-up to how the above rankings a-e are calculated, it would seem that b and c have been precalculated for every entry head, as they do not vary depending on the search terms. That would leave a, d, and e to be calculated at search time, as they would be dependent on comparing the search terms and the dictionary definition content. For d (CVD), there is an existing function in a pre-existing package that calculates those values quickly. Likewise, a and f should be relatively simple calculations of string intersects, a making use of an ordered approach, giving decreasing weights for later occurring matches.

aarppe commented 1 year ago

For now, this is as done as it can be for the time being. We still haven't updated the search ranking calculation to apply for all dictionary entries, which can be left as a task for later, and we might want to empirically estimate the search ranking weightings based on the new survey results, but these can be left for new, later issues.

aarppe commented 1 year ago

The calculation of the above frequencies is described in #1040 and #1041.