Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
712 stars 132 forks source link

wayuu bulk import of sentences #2786

Open RyckRichards opened 3 years ago

RyckRichards commented 3 years ago

Rafael is requesting to have his sentences in Wayuu (guc) uploaded on Tatoeba (about 900 sentences)

He has sent me the file and I've uploaded it here. It's in .csv

9000_10000_guc_spa.zip

jiru commented 3 years ago

Great! Just making the link with #1762 so that we can keep track of this.

jiru commented 3 years ago

The file contains Wayuu-Spanish pairs, so we will have to add these sentences as translations of existing Spanish sentences of Tatoeba. However, the file does not contains the Spanish sentence ids, only the text. This means it will become a problem if some of the Spanish sentences are modified.

For future use of this file, I tried to match the text of Spanish sentences with existing sentence ids while this issue is still recent. I found exact text matches for all but one sentence: #3252816. Since it’s a minor change, I just modified the Spanish text (cola→Cola) and generated a new file that contains Spanish sentence ids: 9000_10000_guc_spa.with_tabs_and_spa_ids.csv.zip

ckjpn commented 2 years ago

I've noticed that all of these have since been released.

https://tatoeba.org/en/sentences/of_user/Wayuu

https://tatoeba.org/en/sentences/show_all_in/guc/none/none/indifferent

Is there some kind of problem with these sentences?

trang commented 2 years ago

@ckjpn The profile description says:

The Wayuu sentences by this account were released so when (if ever) the mass-importing feature is enabled, importing them again will automatically transfer them to the user who sent them. It's better having sentences in a very poorly represented language like Wayuu as unowned for the time being than not having them at all.