Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
694 stars 132 forks source link

Keeping Tatoeba data up to date requires a lot of unnecessary downloading #2680

Open LBeaudoux opened 3 years ago

LBeaudoux commented 3 years ago

Problem

Some developers want to enjoy the most complete and recent copy of Tatoeba data on their devices. Currently, the best they can do is downloading the export files they need on a weekly basis.

Each time, they have to download the whole dataset once again, while in most cases, only a small proportion of the data is added, modified or removed between two updates. This unnecessary downloading can be particularly problematic in mobile environments that are subject to data quotas. Also, once the download is complete, quickly identifying the differences with the previous version may require too much memory for some devices.

Proposed solution

In this post, @jiru has already highlighted an enhancement of the exports that could help with this problem: for each file exported, another file containing only the differences to the previous version could be generated and made available on the Tatoeba server.

The easiest way to build these "diff files" may consist in using a command-line tool that compare similar CSV files to each other. However, such an operation applied on large files easily consumes several GB of memory. And this issue would get worse with the growth of the Tatoeba corpus.

Performing the comparison directly in MariaDB would surely be more efficient in terms of memory. Every week we could copy the results of the export queries into newly created tables in the database. Two versions of the same export table could then be compared thanks to joins on primary keys, and the differences found saved as a CSV or JSON file. For a week, the latest export tables could be conveniently used to generate custom parallel corpora. Finally, after being compared to a newer version of themselves, they would finally become useless and consequently deleted.

mramosch commented 3 years ago

Although acknowledging that there are (even for mobile devices) many possible use cases for these bulk-downloads it is IMHO still important, to mention (yet another time) the importance and the benefits of a proper API. Even in the context of this thread.

I for example have added an ‘Extended native speaker quality evaluation system’ for sentences to my App that needs the most recent data in order to not waste man power only because native contributors are evaluating sentences that already got enough evaluation input in order to appear as 100% ‘curated’, but simply don’t show up us such because the update comes only once a week.

By that time, because of this possible ‘over-evaluation’ of some sentences, an enormous amount of contributions would have been submitted in vain that could instead have been used to bring other - not yet sufficiently evaluated sentences - to a higher curation level.

Living on huge file downloads that have to be checked, sanitized, organized, pushed over and stored on every user’s device in their entirety is a real bad choice for that use case. Eventually only real-time access to this data could solve the disk-space, mobile data quota and not being up-to-date problem with all its consequences...

LBeaudoux commented 3 years ago

It seems to me that we are dealing with two types of scenarios here.

'Hot differences' scenario

The third party application needs to interact tightly and almost in real time with Tatoeba. The communication is conducted through a REST API. To update itself, the application queries the sentences, links, tags, users, audios, etc. that have been added, modified or removed since a datetime.

For these requests to work properly, every change made on a database record would have to be timestamped somewhere (and easily accessible). It doesn't seem to me that this is the case right now. Besides, in order to keep the responses' sizes below a few megabytes, the period of time between updates should not be too long: one week may be a suitable maximum.

'Cold differences' scenario

The third party application doesn't need the latest contributions made in Tatoeba but still wants to take advantage of the improvements of the corpus. For this, it automatically triggers updates once a week. The first time, the complete export files are downloaded. The following times, only the packages containing the incremental differences of the missed weeks are downloaded.

To make this possible on the server side, a copy of the export views could be saved each week into a dedicated database. Note that if the changes in the tables have not all been timestamped, it is still possible to deduce them by comparing the tables from one week to the next.

The 'export database' would then be optimized to generate:

jiru commented 3 years ago

While I do think we need to solve this issue, I’d rather start implementing a solution that answers an actual real-world use case instead of just thinking about "some third-party application" that "might need" this or that. Otherwise I am afraid we are going to implement something that works but will be little use in practice. In other words, start simple and don’t think too big.

@LBeaudoux If you are creating a third-party application, please share your use case with us. For example, after you downloaded the CSV files, do you convert the data into a different format?

Such issue has certainly been tackled by other projects. The first that comes to my mind is the OSMAnd app. That app downloads large chunks of OpenStreetMap data to display offline maps of the world. You can basically update the maps manually every now and then by downloading the new stuff (similar to your 'Cold differences' scenario). There is also a feature called "OSMAnd Live" that keeps your local database constantly up to date by subscribing to a feed of updates (similar to your 'Hot differences' scenario).

For this to work, I have the feeling that we’d need to change the way we store our data under the hood, to make it more "transactional", similar to mysql’s binlog or git revisions. Every single change to the corpus should have a "revision number", and you should be able to go to a newer revision by applying the log of changes to your own copy of the corpus, from an arbitrary revision number. While we do have a log of changes (the contributions table), not only it does not record all the changes, but it contains a lot of inconsistencies because of crashes or manual edits of the mysql database. Even without bugs and crashes, relying on mysql makes the whole system error-prone because we can break it just by running some SQL queries. Instead, we’d need a database system with a revision feature built-in.

Related: #1633.

mramosch commented 3 years ago

@jiru: While I do think we need to solve this issue, I’d rather start implementing a solution that answers an actual real-world use case instead of just thinking about "some third-party application" that "might need" this or that. Otherwise I am afraid we are going to implement something that works but will be little use in practice. In other words, start simple and don’t think too big.

I’ll post my comment in a new thread in order not to hijack this one here...

LBeaudoux commented 3 years ago

If you are creating a third-party application, please share your use case with us.

I don't have any particular project in progress at the moment, but I will try to illustrate some real (or at least plausible) examples.

A flashcard app could enjoy incremental data releases

Let's say that as a developer, I have built a mobile flashcard app. When a user wants to review a word, he adds it to his vocabulary list and flashcards of good Tatoeba sentence pairs containing this word are generated automatically. The app works offline thanks to a local SQLite database storing all sentence pairs above a minimum quality score. As of today, a data update would consist in (re)downloading all sentences and links related to a language pair. Then, all sentence pairs would have to be (re)scored to discard low-quality pairs. By comparing new export files with their former versions to identify differences, the process can be optimized at the cost of a larger peak memory footprint. But ideally, only necessary data should be downloaded, processed and stored on the user device. This is why providing incremental data releases would be a significant improvement.

Tatominer could mirror Tatoeba in real-time thanks to a transaction feed

As a developer, I launched Tatominer to highlight the vocabulary searched on Tatoeba that is poorly covered. For now, when a user adds a sentence on Tatoeba containing a word listed by Tatominer, his contribution is not immediately taken into account by Tatominer. One contributor noted that immediately seeing the impact of his contributions would help him keep track of where he is, and also give him a feeling of accomplishment. Providing a feed of the transactions made on the Tatoeba database minute by minute would enable Tatominer to track these new contributions more closely.

That app downloads large chunks of OpenStreetMap data to display offline maps of the world.

OpenStreetMap daily, hourly and even minutely publishes diff files. Tools like Osmium have been developed to process these diff files. This could indeed be a source of inspiration for us.

I have the feeling that we’d need to change the way we store our data under the hood, to make it more "transactional", similar to mysql’s binlog

I stumbled upon a daemon that reads MySQL binlogs and writes row updates as JSON that can be later used to replicate the source database. It seems to me that this tool could help us to pipe a stream of transaction events to diff files. This transaction stream could also serve as a base for new Tatoeba features (e.g. custom news feeds).

jiru commented 3 years ago

About the flashcard app, I can imagine the app would initially download CSV files, insert the sentences into its database and remove the CSV files (to save disk space). Then, to update its database, the app would need to get a list of new insertions, removals and updates of sentence texts, a list of insertions and removals of links.

About Tatominer, maybe it could benefit from Manticore’s percolate queries (we’d have to switch from plain indexes to real-time indexes though). If we somehow serve percolate query results through an API, Tatominer could receive all new sentences that match the watched keywords only.

jiru commented 3 years ago

Maybe we could use the system-versioned tables feature of mariadb.

LBeaudoux commented 3 years ago

Maybe we could use the system-versioned tables feature of mariadb.

It seems possible to compare two weekly versions of a system-versioned table within MariaDB. But exporting diff files with this method would be heavy, especially for shorter time periods (e.g. hourly or minutely diffs).

The more I think about it, the more I am convinced that we should be inspired by the OpenStreetMap example and periodically publish the Tatoeba transaction history as JSON files. This approach would cover more use cases and scale better than the other CDC options we have discussed so far.

Obviously, this would be an ambitious goal that would need to be broken down into several milestones. A first step could be to generate a daily transaction file to experiment with Maxwell's Daemon.

This piece of software can read binlogs on the fly and write row updates as JSON (no migration required) :

bin/maxwell --user='maxwell' --password='XXXXXX' --host='127.0.0.1' --producer=stdout > today_transactions.json

For example, when a new translation is added:

  mysql> INSERT INTO `tatoeba`.`sentences_translations` SET sentence_id = 269101, translation_id = 9965114;
  maxwell: {
    "database": "tatoeba",
    "table": "sentences_translations",
    "type": "insert",
    "ts": 1449786310,
    "xid": 940752,
    "commit": true,
    "data": { "sentence_id": 269101, "translation_id": 9965114}
  }
jiru commented 3 years ago

I second your suggestion.

By the way, the weekly CSV files are mostly consistent to one another but not exactly. For example, if a new translation is added while sentences.csv is being generated, the sentence won’t show up because it was added after the generation started, but I believe it will show up in links.csv because that file is generated later on. This happens because the files are not generated in a single transaction. It’s probably easy to fix with a BEGIN/COMMIT though.