Rct567 / FrequencyMan

An Anki plugin to sort your new cards.
GNU General Public License v3.0
4 stars 0 forks source link

Different lengths of frequency lists #10

Open aleksejrs opened 1 month ago

aleksejrs commented 1 month ago

Imagine you have

IIUC, the shorter frequency lists are kind of "stretched" along the longer ones. Assuming the numbers show the order in the particular list and the lists do not intersect: 1_____2_____3_____4_____5_____6_____7_____8_____9_____10_____11_____12_____13 1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28_29_30

So the order will be

By that time, you will have seen 7 words from the list based on the unimportant corpus and 3 words from the list based on the important corpus.

Is there a good way to change the balance?

Rct567 commented 1 month ago

If you are referring to 2 frequency lists from different languages, then yeh, it's kind of "stretched".

The score derived from the word frequency lists for a 'language' currently works like this:

aaa => (3/3)
bbb => (2/3) => 0.666
ccc => (1/3)

aaa
bbb => (4/5) => 0.8
ccc => (3/5)
ddd
eee

aaa
bbb => (6/7) => 0.85
ccc
ddd => (4/7)
eee
ddd
eee

aaa
bbb => (8/9) => 0.88
ccc
ddd
eee => (5/9)
fff
ggg
hhh
iii

I guess the original assumption was that at least one word frequency list (in a language data folder) would reflect the whole language, and languages might have different amount of words/inflections. If this is not the case in your situation, than that would be a way to balance it (have full word frequency lists for every language, besides other word frequency lists you might be using).

If you are referring to how multiple word frequency lists for a language are combined, then it should not be "stretched" at all, as it is just based on the top position found:

list_a.txt:
aaaa
bbbb
cccc
dddd
eeee
ffff
gggg
hhhh

list_b.txt:
cccc
aaaa
bbbb
dddd
eeee
ffff
gggg
hhhh

result:
cccc: 1.0,
aaaa: 1.0, 
bbbb: 0.875,
dddd: 0.625, 
eeee: 0.5, 
ffff: 0.375, 
gggg: 0.25 
hhhh: 0.125
aleksejrs commented 1 month ago

I am too tired to re-read this, so you can skip to the last quote or even the last paragraph. Everything above that is how I use frequency lists and why.

With MorphMan and AnkiMorphs, I used what AM calls a study plan: each file gets a frequency list, then the lists are concatenated in the order of file paths/names. However, I doubt that's good for the exposure feature.

Before I learned that MorphMan behaved that way, I tried prioritizing by creating multiple symlinks to directories according to their priority, so the list generator would see multiple copies of the important files in their directory, and count their morphs multiple times. IIRC, that doesn't work with AM, and I didn't try bothering the dev who is probably using Windows.

Now, with AM's priority list (= frequency list) generator, I do something like this:

corpus/ en/ en/en_sometopic/ es/ ru/ mix/

Each of those directories contains a directory tree: nonimp_urg/imp_nonurg/imp_urg/veryimp. So I put the least important files (books, articles, songs, subtitles) directly into the language directory, and the most important ones into the veryimp directory. Then I can generate frequency lists for "veryimp", which only contains the words from the most important texts, for "imp_urg", which contains the words from the important-urgent directory including the "veryimp" directory.

Then I put the frequency lists into directories similar to the top of the corpus tree above. Because I want "en_sometopic" to affect general "en" more than just files in a non-important directory (because it has its own tree, so that I could generate priority lists for it), I use union mounts to make "lang_data/en" show the priority files from "en", "en_sometopic" and "mix", while "lang_data/en_sometopic" is just a symlink.

If you are referring to how multiple word frequency lists for a language are combined, then it should not be "stretched" at all, as it is just based on the top position found:

But then if the short list's word #2 is less frequent in the long list than the long list's word #4, the long list's words #2, #3, #4 will have a higher priority. If the most important source is very short, its list could be meaningless for some decks.

A very long book I am reading has its own priority list, and so does a long book I have read before. AM and MM allow limiting the list to some most frequent words, but that's not good enough if I want to prioritize a very short list and still have rare words sorted in some decks.

One way could be to duplicate the contents of a short list to make the actual words in it appear higher. It seems that FM will have no problem with that and does not skip duplicates, thus keeping the list length as long as the word matches the acceptable length and regular expression. But it doesn't sound like a good way.

Rct567 commented 1 month ago

The latest version on the master branch now has an option to make the word frequency values static (not depended on the length).

You just need to change the default value for parameter absolute_values from False to True:

https://github.com/Rct567/FrequencyMan/blob/64142472bfb2da181e90dce29267d46867f41162/frequencyman/lib/utilities.py#L189

If you want to test it, you can just download a copy of FrequencyMan from the master branch and put in it the plugin directory.

It will affect the values for word frequency, most_obscure_word and lexical_underexposure. It might affect other things as well. I also haven't calibrated the default ranking weight for any changes it will cause.

I dont know if it addresses the original issue you reported, but I think in theory at least it should work better. Position 42 in a word frequency list should have the same value independent of the length.

aleksejrs commented 1 month ago

Thanks, it's probably better. I haven't tested it yet.