Closed aarppe closed 1 year ago
Looking at the search ranking results right now, I'm seeing this issue as follows.
We have three basic types of information that can be used for ranking the results:
[edit: modified weights] Following up on the above, after the glossary_count
andmorpheme_dictionary_frequency
ranking have been standardized for range [0, 1]
, I'd use the following weightings for producing the aggregate relevance ranking (since lemma_corpus_frequency
seemed to have a negligible impact, I'd leave that out, keeping only glossary_count
and morpheme_dictionary_frequency
:
a. analysis_feature_match
: 3
b. glossary_count
: 2
c. morpheme_dictionary_frequency
: 1
d. cosine_vector_distance
: -2 * (1-x)
e. target_language_keyword_match
: 1 (as integers)
When using pos_match,
I cannot see why we'd retain the logical is_espt_result
.
Later on, we'd want to fit these weights empirically, of course.
As a follow-up to how the above rankings a-e
are calculated, it would seem that b
and c
have been precalculated for every entry head, as they do not vary depending on the search terms. That would leave a
, d
, and e
to be calculated at search time, as they would be dependent on comparing the search terms and the dictionary definition content. For d
(CVD), there is an existing function in a pre-existing package that calculates those values quickly. Likewise, a
and f
should be relatively simple calculations of string intersects, a
making use of an ordered approach, giving decreasing weights for later occurring matches.
For now, this is as done as it can be for the time being. We still haven't updated the search ranking calculation to apply for all dictionary entries, which can be left as a task for later, and we might want to empirically estimate the search ranking weightings based on the new survey results, but these can be left for new, later issues.
The calculation of the above frequencies is described in #1040 and #1041.
We've now got a number of updated resources for integration with the overall regression formula for crk relevance ranking (calculations laid out in #1040 and #1041):
crk/generated/ahenakew_wolfart_bloomfield.fst+cg.freq-sorted.txt
crk/generated/CW_aggregate_morpheme_log_freqs.tsv
crk/generated/crk_glossaries_aggregate_vocab.tsv
These could be used to update the source for the regression formula, as well as potentially revise the formula by adding new explanatory variables (in particular lemma morpheme count).