Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
703 stars 132 forks source link

Allow sorting sentences by difficulty #3033

Open cangareijo opened 1 year ago

cangareijo commented 1 year ago

Allow sorting sentences by difficulty.

https://tatoeba.org/wall/show_message/39502#!#message_39502

DJ-Saidez commented 1 year ago

That's subjective (ie. varying definitions and likely difficult for machines to process) and is likely better done manually through lists, although I do agree it'd be very useful for learners.

ckjpn commented 1 year ago

I think this would be very difficult to implement, and if not done well would imply that the results were more accurate than advertised. If this is going to be attempted, I think it needs to be done very carefully.

Sentences can be difficult because of unfamiliar patterns or unfamiliar vocabulary. Sentences can be difficult based on things like ambiguity. What's easy in a foreign language is related to one's native language or other languages one has studied, since some patterns are similar and some vocabulary is similar.

Some ideas, though ...

Sentences with old-fashioned or archaic language use, both vocabulary and patterns, would likely be more difficult, so they could be filtered out and added to the bottom of the list.

Sentences containing low-frequency vocabulary would be more difficult, so they could be filtered out and put at the bottom of search results. Note though, that even some high-frequency words have low-frequency meanings, which means that care would need to be taken to move such sentences lower in the search results.

Longer sentences can sometimes be more difficult since they often use more complex sentence patterns and long phrases. These could be put at the end of search results. However, not all long sentences use complex patterns.

Sentences using phrases and words not seen or heard much in modern life would likely be more difficult, too, so sentences taken from old public domain books could be moved to lower on the list. However, not all sentences from public domain books fall into this category.

All that said, I think this would be very difficult to implement successfully for multiple languages. It would take hours of a researcher's time to even get it working well with one language.

I somewhat attempted to group sentences by level based on vocabulary a few years back.

http://www.manythings.org/tatoeba/ogte.html CK's OGTE-Level Lists

You can read the "About" section on the page for more information.