LuteOrg / lute-v3

LUTE = Learning Using Texts: learn languages through reading. Python/Flask.

MIT License

416 stars 46 forks source link

Add Unique Words stat to Word Count #180

Open M-Biggles opened 7 months ago

M-Biggles commented 7 months ago

As the title says, it would be good to have the number of unique words alongside the total words on the home page under Word Count.

jzohrab commented 7 months ago

This might be tough to manage, processing-wise.

Do you feel it adds value to you as a learner, or would it become "yet another distraction", in a way? (E.g., when I was a kid reading a difficult book, I didn't know that it had 9700 unique words, I just knew it was a big book :-P ).

jzohrab commented 7 months ago

(Recognizing that adult learner needs are different than a kid's reading needs, but I still like that mindset when considering or designing things) Cheers!

M-Biggles commented 7 months ago

Definitely adds value. I'll say that the unique word-count is more useful in working out the difficulty level of a text than the total number. Easier texts tend to have less unique words, and the same goes for difficult ones (which is why we restrict word count when creating graded materials). It really helps many to select which text to work on next when they can't pick and are trying to set up learning goals (Read some A1-A2 materials, later read the B1-B2 stuff, etc).

Processing-wise, allow it to be turned on or off in the options (hiding it will not switch off that processing, though, so perhaps the option is the better path).

"Allow Processing of Unique Words" / "Display number of Unique Words"

M-Biggles commented 7 months ago

An additional note: with the number of unique words combined with a word frequency list, it would be possible to work on a tool to automatic tag texts for level, such as*:

A1 = 0-600 words A2 = 601 - 1,200 words B1 = 1,201 - 2,500 words B2 = 2,501 - 5,000 words C1 = 5,001 - 10,000 C2 = 10,001 - 20,000

Numbers from here, but we could look into it more would vary per language

That's another feature for another day, but it would be made more possible by a calculation of unique words.

jzohrab commented 7 months ago

I'd leave the tagging to the users, this could get error-prone with different languages. But someone else could implement it if they'd like :-)

M-Biggles commented 7 months ago

Yeah, I wouldn't implement without having some decent data on the levels. HSK numbers and such.

Maybe it could be a per-language thing, only being operative for preloaded languages where we have good counts for the levels and allowing users to set their own.

jzohrab commented 7 months ago

Holding off until #250 is done.

jzohrab commented 7 months ago

250 done, not blocked now.

jzohrab commented 7 months ago

Now that #250 is done I was looking into this.

I have a "sampled text unique words" count, but it's only for the same sampled text used to calculate the book stats, i.e., it's only for the next 5 pages. I can add that quickly.

Adding a full unique word count for the whole book will be tougher as it requires a full book parse/fake render.

M-Biggles commented 7 months ago

All sounds good to me.

A thought: would it be possible to store the unique word count of the whole-book parsing as a value somewhere? Do a full-parse at book creation and don't reparse except in case of edits, since that's the only way the value would change. With a little toggle to enable or disable full-parse at book loading.

jzohrab commented 7 months ago

Yes, it's def possible, is just a more involved request than adding the sampled uniques count. :-)

The full parse is needed at book load anyway to do pagination, so it's more a question about where to calc and store the value, and when to update it.

M-Biggles commented 7 months ago

Yes, it's def possible, is just a more involved request than adding the sampled uniques count. :-)

The full parse is needed at book load anyway to do pagination, so it's more a question about where to calc and store the value, and when to update it.

Yeah, it's a further step beyond the 5-page sample, which is a great feature itself, but I figured it would be good to do at some point. Grabbing it from the initial parse sounds right.

M-Biggles commented 6 months ago

303 impacts this one. Only doing the current page would help with the processing load.