Open cangareijo opened 2 years ago
Seems it be resolved with a gsub("[\r\n\t]", " ", $field)
in the export scripts.
For the following input
this is a test
of spaces
x this is a tab
We get
this is a test \nof spaces \nx\tthis is a tab
The original text can be reproduced from this output. I am hesitant to purge all special characters for that reason. Anyone who wants to get rid of the escaped \n
\t
etc can do so themselves? Will PR something.
@CangarejoAzul where is queries.csv downloaded from?
Turns out some of the files are exported straight from SQL. So the output can just be escaped like
-- User skill level per language
SELECT ul.language_code, ul.level, u.username, REPLACE(REPLACE(REPLACE(ul.details,'\t','\\t'),'\n','\\n'),'\r','\\r') as details FROM users_languages ul INNER JOIN users u ON ul.of_user_id = u.id ORDER BY ul.language_code ASC, ul.level DESC, u.username ASC;
Output is like
this is a test\\r\\nof spaces\\r\\nx\\tthis is a tab
Out of curiosity, where can I download a copy of Tatoeba's database?
If by "copy" you mean to download the SQL database dump including the data, that is not something available.
If you would just like to have the database schema, I believe the only way at the moment would be to install Tatoeba locally, SSH into the TatoVM and run mysqldump --no-data tatoeba > database.sql
.
Checking user_languages.csv
, I didn't find any fields containing a tab, but in cases where there is a newline, it is preceded by a backslash, escaping it. I'd be surprised if tabs weren't escaped with a backslash as well. So these files can be parsed correctly even when some fields contain tabs or newlines, you only need to handle backslash escapes. I think lbdx's tatoebatools does this.
Although the file format isn't the most convenient, I'd be hesitant to change the way special characters are escaped, since this would break the parsing logic of everyone who already figured out how to use the files.
I think the better solution would be to improve the documentation on the download page to explicitly mention these edge cases. (And possibly point users to the tatoebatools
Python package so they know they don't have to do the parsing themselves.)
Hi. Some of the dump files on the Downloads page are incorrectly formatted.
The details field on the user_languages.csv file, for example, allows tabs and newlines, which should not be allowed in a TSV file. They should be replaced with spaces. Also, the file contains some lines with empty fields, which should also be filled with spaces.
The query field in the queries.csv file allows commas and newlines, which should not be allowed in a CSV file. The file should be converted to TSV. Also, queries.csv is either not encoded with UTF-8, although it should be, or is corrupted, because I get a decoding error when reading it in Python using the line of code below.
for line in open("queries.csv", encoding = "utf-8"): pass
queries.csv should be updated once a year, excluding queries made in the previous year, to prevent manipulation of Tatominer.
TSV files should use the extension .tsv instead of .csv.
sentences.csv, sentences_detailed.csv, and sentences_base.csv could be consolidated into a single file.
user_languages.csv could be renamed languages.tsv and users_sentences.csv could be renamed reviews.tsv.
users_sentences.csv should be compressed, like the rest of the files.