Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
699 stars 132 forks source link

Anki: enriching file with example sentences #1168

Open dispyfree opened 8 years ago

dispyfree commented 8 years ago

Smiliar to to issue #1166, this feature helps users by providing them with lists of example sentences. However, instead of just displaying a list of sentences for a given word, it deals with entire uploadable Anki-files and enriches them semi-automatically with example sentences.

Anki is a commonly used vocabulary trainer; it is available online, for many operating systems as well as cell phones. Please refer to http://ankisrs.net/.

Many people use Anki; they either use predefined lists (so-called "shared decks") or insert the vocabulary on their own. As those vocabulary lists quickly grow to magnitudes of several thousand ones, it is very cumbersome to add example sentences for individual entries. Furthermore, most people will not feel inclined to add example sentences to existing decks with several thousand entries.

Step 1: Uploading

The user uploads an arbitrary Anki-style file to the website. Furthermore he specifies the source column (i.e. the column which contains the word for which example sentences are to be found) as well as the target column (possibly a new one; if an existing one is chosen, the existing sentences will be preserved and only overridden if the user chooses a new sentence). All other columns are not of further interest and might not be shown in the interface at all. They stored to be exported back to Anki.

Please note that also two destination columns could be selected: one for the target language and one for the translation in the native tongue. (Plus transcriptions, for instance Pinyin for Chinese). First and foremost, only the target language is of interest.

Step 2: Assocation

After importing, each entry is automatically associated with an example sentence in the target language. This association could be done by several weightings:

After the automatic association, the user is able to override each automatically associated sentence. Furthermore, he might request sentences for words which do not have any associated sentence yet (see #1166).

If the user specified a sentence himself which is no match with any sentence in the database (called "individual sentence"), this sentence can be exchanged for any sentence in the database or it can be kept.

The user does not need to alter any of the sentences themselves - changes can be done in Anki itself after exporting. Alternatively, the editing can be allowed by converting the result to an "individual sentence".

Step 3: Export

Finally, the user can export the set at any time. The result is a simple file which can be directly imported into Anki. Exported lists can be exported again at any time. Anki is smart enough to merge changes to the underlying files such that the learning progress is not lost. This way, the same lists can be updated and extended over time without interference with the learning progress.

This feature could be integrated seamlessly with #1166 by treating individual entries as vocabulary items. However, we should then introduce groups for items, as otherwise decks will mingle in the interface. As this is done automatically, several thousand items will be created, which will bloat the interface if not sorted by groups.

Anki uses simple CSV files. Therefore, most other tools should be able to process those files (possibly after using another, very simple converter, even Excel).

First of all I'd like to hear your opinion on this idea. If the idea is considered interesting I can also make some suggestions/sketches for the interface. I have been thinking about that for quite some time.

I realize that this feature will incur a huge workload; I am willing to take it up.

ckjpn commented 8 years ago

I wonder if it might not be better, and perhaps easier, to create a Python script (or a script in another language), that can import an anki file, access the sentences.csv file that is exported and grab the sentences locally on a user's computer.

See the downloads page to grab the necessary file(s).

dispyfree commented 8 years ago

I can think of those drawbacks:

My strongest point is the very first one; I do share your concern for the load. Would you as a user expect a GUI for such an application? Especially when choosing between different possibilities for one entry, a GUI should come in handy.

trang commented 8 years ago

From my understanding, each of the steps mentioned in the issue description can be considered as 3 independent features.

  1. The possibility for users to mass import vocabulary. It can be a useful even for non-Anki users. For instance @ckjpn mentioned to me today that would like to import frequency lists.
  2. The possibility to associate specific sentences to a vocabulary item. This is already part of the plan for the vocabulary list feature.
  3. The possibility to export vocabulary lists (and associated sentences) in a format that can be imported into Anki or other SRS apps. This can be useful just for backup purpose for people who are paranoid to lose their data, or for offline usage for people who have unstable internet connections.

Allowing users to import and export will indeed put more load on the server, but the fact that it puts more load on the server is not a reason to discard the idea. It would be very beneficial for users if Tatoeba could provide these features.

It will not be easy to implement but if @dispyfree is up for the challenge, I don't see any problem.

@dispyfree, if/when you are ready to start working on this let me know. I will create separate issues for the import and the export functionalities.

jayrod commented 8 years ago

I've started working on something very similar to this process. https://github.com/jayrod/fluentanki/tree/working

My "working" branch has a python script that is doing some of the stuff your looking to do. I just started on a feature that would auto download the tatoeba db and pull in example sentences. I'm open to help and suggestions.

jiru commented 8 years ago

I’m all for such feature. How about directly interfacing with ankiweb instead of manual export/import? This would allow adding directly from Tatoeba into your Anki deck. I’m not sure about the feasibility, just suggesting the idea.

dispyfree commented 8 years ago

I thought of another issue which can only be tackled by a local application.

Apart from a few languages (for instance Chinese), verbs and nouns need conjucation or declination to adapt to other parts of the sentences. Those transformations alter the original stems drastically (also known as weak/strong verbs), the result being that if you search for a verb, its conjucations are very likely not to show up. One could argue that the most important tenses (most likely including the present tense) will be somewhat close to the stem and therefore be found by a simple heuristic (one could just cut off the last few characters). On the other hand, this restriction limits the usefulness of this feature greatly.

There is a workaround, however. Linguists worked hard to create models which are essentially able to break down arbitrary words to its stems. Those tools, however, require a lot of resources and should NOT be run on a server. To back up my claim, the tool I have in mind needs at least 500MB as the baseline. Please note that those tools are only available for a few major languages (i.e. English, French, German, Spanish, Chinese and that's about it).

On the other hand, this processing could also be done when inserting the phrases into tatoeba in the first place. When you search for "go", you will not find sentences which use "went". This would also allow for queries for time: if you are not sure how a verb is used in past tense, you could use its stem plus the filter "past tense". This could actually speed up the search/indexing process, as there are not as many stems as conjucations. However, I am not sure how users would feel about that; most might expect only exact hits.

This problem only affects the automatic association feature; conjucated/declinated words can still be associated manually. As a solution by a program routine is both complex and hungry for resources, I don't see it running on any server. However, this makes a strong argument for a local program. When considering the pros and cons, I still think that manual association can resolve this problem, as most users are much more likely to pull ready-made lists instead of associating themselves. With some help of native speakers, this problem can easily be ameliorated. What do you think?

@trang: I am ready to go. I shall prepare some sketches. Is #1166 stable enough to build upon that code?

@jiru: This is an excellent idea; when the feature is finished, this can be suggested to the creator of Anki.

jiru commented 8 years ago

@dispyfree About the stemming problem, you might want to look at how Sphinx is dealing with that. One of Sphinx developers’ concerns is to make it work as fast as possible with the lowest memory footprint (both during indexation and look up). So it actually works on the server. Whenever you make a search in languages we have stemmers for, Tatoeba returns (some of the) conjucated/declinated words too. On tatoeba.org, we currently make Sphinx only use libstemmer (for languages it’s available), but there are interesting alternatives, such as lemmatizers or plain word lists.

trang commented 8 years ago

@dispyfree, I created the issue #1281 for the linking of sentences to vocabulary items. It's not complete yet, I still want to add some additional mockups, to illustrate everything. But you (and others) can already have a look at it and let me know if functionally speaking it makes sense to you.

I also created #1282 (importing) and #1283 (exporting), and also need to be completed, but that's for much later.