LuteOrg / lute-v3

LUTE = Learning Using Texts: learn languages through reading.
https://luteorg.github.io/lute-manual/
MIT License
493 stars 46 forks source link

A better way to sort by difficulty #453

Open quinnlas opened 4 months ago

quinnlas commented 4 months ago

Is your feature request related to a problem? Please describe.

Despite being intermediate in my TL, I was having difficulty sorting the books I had uploaded to find the easiest one.

Describe the solution you'd like

I believe the best metric for book difficulty would be

Unique unknown words / Total words
Over a short number of pages

This reflects how often you will need to look words up while reading. It would be nice if you could sort the active book lists by this.

Describe alternatives you've considered

I ran the project from source and tried several alternatives, the above was by far the most useful for finding a good book to read. But here are my thoughts on the options I tried.

Unique unknown / unique total over 5 pages (default) This doesn't seem very useful to me. I think the denominator should definitely be total words since that takes into account the repetition of common words.

Unique unknown / unique total over the whole book This has the same issue while also being slower. I'm not suggesting to calculate stats over a large number of pages anyway. But I thought I'd mention that if a hashmap of unique words to their counts were saved for each book, it would be a lot quicker to calculate any "whole book" stats. So while this method was slow with a basic implementation, it wouldn't need to be with a better one. But still, not that useful.

Total unknown / total over the whole book The issue with "total unknown" is that it doesn't tell you how often they're repeated. Which in turn means that you don't know how many lookups you'll have to do. For example, I had a short wikipedia article with 54 new words, and 45 unique new words. The article was only 110 words long. Meaning that a very large percentage of them would need to be looked up. But you could have a book where the new words are repeated often, and you only need to look them up once each. That would be way easier, with the same calculated difficulty.

Unique unknown / total over the whole book This does give you the "lookup percent" across the whole book. The downside is that this heavily favors very long books. In my case, the TL translation of War and Peace. This is obviously not a good choice for an intermediate learner. But given a long enough book, almost all words will be repeated and bring down the calculated difficulty. But we really just want to know how difficult the book will be immediately when you start reading it.

Unique unknown / total over 5 pages (Winner) This works really well in my experience. The ideal number of pages I'm not sure of. But you would want it to be enough to compensate for progress on the current page, since that will bring down the number. Another idea could be to just skip the current page and use the next X pages.

Too low of an X value will not account for variation in difficulty between pages. But too high of a number would start to cause the issues of the previous method. 5 seemed to work well for me, in any case.

This metric doesn't account for variation of difficulty in say, different chapters of a book (or on any pages it didn't consider). But books tend to not have that, so I think it's ok.

jzohrab commented 3 months ago

@quinnlas - a belated thanks for this issue which dropped off my radar. Good analysis too, it's hard to find the right metric for something like this, especially since it combined a few dimensions (X unique items that can repeat Y times).

If the next 5 pages had 1000 words, with 2 unknown words repeated 10 times each, say, and the rest of the text consisting of 100 repeated words: Unique unknown / total over 5 pages = 0.002. Unique unknown / unique total over 5 pages = ~ 0.02

I think that currently the code does something like actually renders the pages and then does its calculation based on the statuses of the words it sees. This is important for languages that are character-based, because the characters get combined into multi-word terms. But your logic should still be fine.