Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
713 stars 132 forks source link

Incorrectly formatted CSV files #2984

Open cangareijo opened 2 years ago

cangareijo commented 2 years ago

Hi. Some of the dump files on the Downloads page are incorrectly formatted.

The details field on the user_languages.csv file, for example, allows tabs and newlines, which should not be allowed in a TSV file. They should be replaced with spaces. Also, the file contains some lines with empty fields, which should also be filled with spaces.

The query field in the queries.csv file allows commas and newlines, which should not be allowed in a CSV file. The file should be converted to TSV. Also, queries.csv is either not encoded with UTF-8, although it should be, or is corrupted, because I get a decoding error when reading it in Python using the line of code below.

for line in open("queries.csv", encoding = "utf-8"): pass

queries.csv should be updated once a year, excluding queries made in the previous year, to prevent manipulation of Tatominer.

TSV files should use the extension .tsv instead of .csv.

sentences.csv, sentences_detailed.csv, and sentences_base.csv could be consolidated into a single file.

user_languages.csv could be renamed languages.tsv and users_sentences.csv could be renamed reviews.tsv.

users_sentences.csv should be compressed, like the rest of the files.

vinkaks commented 2 years ago

Seems it be resolved with a gsub("[\r\n\t]", " ", $field) in the export scripts.

For the following input

this is a test
of spaces
x   this is a tab

We get

this is a test \nof spaces \nx\tthis is a tab

The original text can be reproduced from this output. I am hesitant to purge all special characters for that reason. Anyone who wants to get rid of the escaped \n \t etc can do so themselves? Will PR something.

@CangarejoAzul where is queries.csv downloaded from?

cangareijo commented 2 years ago

Thanks.

https://downloads.tatoeba.org/stats/queries.csv.bz2

vinkaks commented 2 years ago

Turns out some of the files are exported straight from SQL. So the output can just be escaped like

-- User skill level per language SELECT ul.language_code, ul.level, u.username, REPLACE(REPLACE(REPLACE(ul.details,'\t','\\t'),'\n','\\n'),'\r','\\r') as details FROM users_languages ul INNER JOIN users u ON ul.of_user_id = u.id ORDER BY ul.language_code ASC, ul.level DESC, u.username ASC;

Output is like

this is a test\\r\\nof spaces\\r\\nx\\tthis is a tab

cangareijo commented 2 years ago

Out of curiosity, where can I download a copy of Tatoeba's database?

trang commented 2 years ago

If by "copy" you mean to download the SQL database dump including the data, that is not something available.

If you would just like to have the database schema, I believe the only way at the moment would be to install Tatoeba locally, SSH into the TatoVM and run mysqldump --no-data tatoeba > database.sql.

Yorwba commented 2 years ago

Checking user_languages.csv, I didn't find any fields containing a tab, but in cases where there is a newline, it is preceded by a backslash, escaping it. I'd be surprised if tabs weren't escaped with a backslash as well. So these files can be parsed correctly even when some fields contain tabs or newlines, you only need to handle backslash escapes. I think lbdx's tatoebatools does this.

Although the file format isn't the most convenient, I'd be hesitant to change the way special characters are escaped, since this would break the parsing logic of everyone who already figured out how to use the files.

I think the better solution would be to improve the documentation on the download page to explicitly mention these edge cases. (And possibly point users to the tatoebatools Python package so they know they don't have to do the parsing themselves.)