Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
679 stars 131 forks source link

Wiki Page Not Editable #3057

Closed ckjpn closed 5 months ago

ckjpn commented 1 year ago

Before replying to a message sent to team (at) tatoeba.org, I tried to update this page, https://en.wiki.tatoeba.org/articles/edit/using-the-tatoeba-corpus , with the following text. Clicking, "save" and also trying "save and continue" did not save the text. Only what's up to ## Processing the Tatoeba Corpus is the change.

# Using the Tatoeba Corpus for Your Own Projects

## Terms of Use

**Creative Commons License**

Tatoeba's technical infrastructure uses the default **Creative Commons Attribution 2.0 France license (CC-BY 2.0 FR)** for the use of textual sentences. The BY mention implies a single restriction on the use, reuse, modification and distribution of the sentence: a condition of attribution. That is, using, reusing, modifying and distributing the sentence is only allowed if the name of the author is cited.

Read the complete [Terms of Use](http://tatoeba.org/eng/terms_of_use). 

**License for Audio Files**

Note that the terms of use for the **audio files** are not the same as for the text of sentences.

See the [list of audio lists](https://tatoeba.org/eng/sentences_lists/of_user/CK/audio%20-/page:1/sort:modified/direction:desc) to see the license, if any, under which these people have offered their files for use outside of tatoeba.org. You should verify these licenses by clicking "audio files" on each member's profile.

## Processing the Tatoeba Corpus

You will probably want to filter out sentences that:

* require correction or improvement
* sound unnatural
* are poor or unnatural translations of other sentences

You may also may want to filter out those that:

* contain vulgar language or sexual references
* contain archaic or old-fashioned content
* are untrue
* are sexist, are racist, are insulting to others, or otherwise inappropriate for your audience
* are particularly long

You can use various forms of metadata to aid with this process:

* tags (for instance, "@change", "archaic", "vulgar"; see  [Tags](http://tatoeba.org/eng/tags/view_all) for more)
* sentence ratings
* contributors' self-reported skill in the language (as indicated in their profiles). Note that several members rate themselves as native speakers of multiple languages and that self-reported levels may not be accurate.

If you are using the data to create language learning materials:

* You should probably use only sentences that you or someone else has personally proofread and not rejected, since you do not want to be teaching people errors.

Note that most sentences that do not have errors are not explicitly marked with an "OK" rating or tag, and some sentences that do have errors are not marked with a negative rating or tag. Taking all of this into account, you will probably need to perform both custom automated processing and manual review.

## Suggestions for Those Planning to Use the Corpus

* Tell your audience how you selected the sentences.
  * See an example on this page:[www.manythings.org/bilingual](http://www.manythings.org/bilingual/)
* Since corrections are being made all the time, you should frequently update your project so your audience benefits from these corrections.

## Download the Tatoeba Corpus

[Downloads](http://tatoeba.org/eng/download_tatoeba_example_sentences) are updated every Saturday.

## FAQ

* [How do I give proper attribution?](https://en.wiki.tatoeba.org/articles/show/faq#i-would-like-to-use-tatoeba's-data-for-my-project.)
* [Where can I download Tatoeba's audio data?](https://en.wiki.tatoeba.org/articles/show/faq#where-can-i-download-tatoeba's-audio-data?)
* [How can I download all sentences and translations in specific languages?](https://en.wiki.tatoeba.org/articles/show/faq#how-can-i-download-all-sentences-and-translations-)
ckjpn commented 1 year ago

This happened for another page as well, so it's a bug not uniquely related to the above page.

I tried to edit this page, too. https://en.wiki.tatoeba.org/articles/show/projects-using-tatoeba

# Projects using Tatoeba

The following page lists projects that use Tatoeba. A "project" can be anything: website, mobile app, research paper, textbook, video...

[http://a4esl.org/temporary/tatoeba/links.html](Link to Projects Using Tatoeba.org's Data and/or the Tanaka Corpus)

-----
See also:

**10,000 sentences**

*About: 10,000 sentences: an Android app to help you learn new words in foreign languages
* URL
[https://github.com/tkrajina/10000sentences(https://github.com/tkrajina/10000sentences)

**thegui’s translations visualization**

* URL: [http://tguinard.github.io/](http://tguinard.github.io/)
* More info: [https://tatoeba.org/eng/wall/show_message/21926#message_21926](https://tatoeba.org/eng/wall/show_message/21926#message_21926)

-----

**CK's tab-delimited bilingual sentence pairs**

[How to Make an Anki Deck with Tatoeba Sentences Using Tab-delimited Bilingual Sentence Pairs that include only Proofread English linked to Native Speaker Sentences](http://www.manythings.org/anki/)
jiru commented 1 year ago

Looks like a bug in Tatowiki. Restarting it just solved the issue, so I am writing down what I noticed in case that bug happens again.

Errors in tatowiki log (journalctl -u tatowiki), such as:

May 01 21:26:41 sloth tatowiki[29321]: 2023-05-01 21:26:41; cppcms, error: database is locked (SqliteModel.cpp:90)
May 08 00:07:11 sloth tatowiki[29321]: 2023-05-08 00:07:11; tatowiki, error: database is locked (Articles.cpp:109)

Reading operations on the database work, but writing do not, for instance using the CLI:

sqlite> insert into uploads values (200,"foobar", 1675010574);
Error: database is locked

lsof on the database file reveals a lot of opened file descriptors (88 in total), all opened by tatowiki.

Only one tatowiki process is running.

systemctl restart tatowiki hangs, leaving tatowiki in some unresponsive state. Need to kill -9 tatowiki and then start it again.

Note: some people reported that using the database in WAL mode solved the problem in their case.

jiru commented 9 months ago

This happened again. Same symptoms, but on the top of that, tatowiki processes were using 100% of the CPU. I enabled WAL mode and restarted tatowiki on tatoeba.org, let’s see if the problem happens again.

jiru commented 5 months ago

So far the problem did not happen again.