Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
704 stars 132 forks source link

Add More Fields (Columns) to the https://tatoeba.org/en/vocabulary/add_sentencespage, and ... #2886

Open ckjpn opened 2 years ago

ckjpn commented 2 years ago

The Requested Feature

Add More Fields (Columns) to the https://tatoeba.org/en/vocabulary/add_sentencespage to allow members to sort in a way that would be the most useful for them.

Additional Ideas (Just Brainstorming Possibilities...)

Related Feature Requests

LBeaudoux commented 2 years ago

@ckjpn Thank you for this overview. As it is, the "vocabulary request" feature is obviously not satisfactory and several of us have made suggestions to improve it.

But the more I think about it, the more it seems weird to ask other users to add sentences that you could easily add yourself. Instead, we should encourage learners to add sentences when a word they are interested in is not adequately covered. Native speakers would then simply proofread these sentences. This would be a win-win situation, especially for the developers who could drop this feature and focus on more useful improvements.

alanfgh commented 2 years ago

Instead, we should encourage learners to add sentences when a word they are interested in is not adequately covered. Native speakers would then simply proofread these sentences.

While this idea has its merits, I don't think it covers every scenario. Not everyone who requests sentences containing specific vocabulary is sufficiently comfortable in the language to write the sentences themselves. Some have such a low level of proficiency that any sentences they would write might not be clearly understood by proofreaders. And some languages don't even have native speakers at Tatoeba who can proofread them (or are willing to do so). There are also the problems of segregating the unproofread sentences, making native speakers aware of their existence in a way that is not as overwhelming and stagnant as what we currently do with the vocabulary requests, and finally moving them into the pool of accepted sentences. I feel that dropping the vocabulary request feature altogether does not solve these problems; it just moves them elsewhere.

LBeaudoux commented 2 years ago

Not everyone who requests sentences containing specific vocabulary is sufficiently comfortable in the language to write the sentences themselves.

I agree that it is often difficult for a learner to produce sentences from scratch. But I think a simpler and more reliable method is to start with an actual sentence and then simplify it. Following the recommendations in this guide is also very important.

There are also the problems of segregating the unproofread sentences, making native speakers aware of their existence in a way that is not as overwhelming and stagnant as what we currently do with the vocabulary requests, and finally moving them into the pool of accepted sentences.

Unlike vocabulary requests, the sentence review workflow is exactly the kind of core feature that we should focus our attention on.

I increasingly think that if we spend so much time here talking about vacabulary requests, it is not because it has potential but rather because it is obviously flawed and difficult to fix. Several years have passed, and it doesn't seem reasonable to me to keep such a frustrating feature in the production branch.

In the long run, it might be more productive to export an anonymized list of current vocabulary items each week. Those interested could then download this data, and even integrate it into third-party applications.

ckjpn commented 2 years ago

I increasingly think that if we spend so much time here talking about vocabulary requests, it is not because it has potential but rather because it is obviously flawed and difficult to fix.

I somewhat agree with this. However, if we are going to have this vocabulary request function, something should be done to make it more useful than the way it is now.

In the long run, it might be more productive to export an anonymized list of current vocabulary items each week.

I would like to see this data exported.

For languages that have had a lot of research done on the frequency of use of vocabulary items, instead of vocabulary requests or even tools like Tatominer, it would likely be more useful to supply lists of words on high-frequency lists that are not yet covered by sentence examples in the Tatoeba Corpus.

LBeaudoux commented 2 years ago

it would likely be more useful to supply lists of words on high-frequency lists that are not yet covered by sentence examples in the Tatoeba Corpus

There are indeed many ways to identify words or sentences that deserve our attention. Each approach has its pros and cons. That's probably another reason why selecting one and hard-coding it into tatoeba2 causes discontentment.

Tatoeba should just periodically release valuable signals such as search queries and bookmarked vocabulary. Interested third-party developers could then code their apps at their own pace and with the technology stack of their choice. After some feedback from the community, the 'inspirational apps' that find an audience could join Tatominer and be highlighted on Tatoeba.

I now sincerely believe that we have to help the very few tatoeba2 developers by allowing them to focus on the transactional features that only they can improve. Most things related to data analysis can just as well be done elsewhere.