Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
712 stars 132 forks source link

Linefeeds in sentence text are breaking the weekly export script #2250

Closed ckjpn closed 3 years ago

ckjpn commented 4 years ago

There may be other problems, but here is the one in sentences_detailed.csv

8649740 eng 817/5000\ Achieve easy success. pelwan 2020-04-01 04:28:01 2020-04-01 04:28:01

My scripts I have from processing the weekly files wouldn't work because of this.

ckjpn commented 4 years ago

The problem is that this item had a linefeed inserted into it.

https://tatoeba.org/eng/sentences/show/8649740

Maybe linefeeds should be filtered out before adding sentences to the database.

It looks like it may have been a copy-and-paste from a flashcard website like Quizlet.

jiru commented 4 years ago

@ckjpn I fixed the exports files if you want to download them again.

Not only the export files are not well formatted for most CSV parsing tools (except MySQL’s), but the export halted half way with the following error:

awk: cmd. line:1: (FILENAME=sentences_detailed.csv FNR=8243386) fatal: can't redirect to `/var/tmp/per_language/pelwan/pelwan_sentences_detailed.tsv' (No such file or directory)

As a result, the per_language files did not get generated.

So I manually edited this week’s sentences.csv and sentences_detailed.csv to remove the garbage sentence 8649740 is prefixed with, and I manually executed the per_language files generation, so the exports are good for this week, but the export script needs to be fixed and we want to prevent linefeeds in sentence text.

The faulty files are still available for debugging purpose: sentences.csv.old and sentences_detailed.csv.old. The relevant part is at lines 8243385 and 8243386.

Note that interestingly, while sentence 8649740 broke the generation of per_languages for sentences_detailed.csv, it didn’t broke sentences.csv (but I fixed it too).

AndiPersti commented 4 years ago

The problem is that the text cleaning I've introduced in 1b31f9c21cfc45cefdb1c93bf29a0beceec78239 has a major flaw: https://github.com/Tatoeba/tatoeba2/blob/436ad1252a45e7d33e4f2405962bea6fae811d29/src/Model/Entity/Sentence.php#L70-L72 This only replaces a series of at least 2 whitespace or control characters with a single space. But I don't think we can simply replace {2,} with + to replace also single occurrences of these characters because than there will be problems with eg French sentences which use different spaces in sentences (see #770).

I've checked all exported sentences for whitespace and control characters (except the normal space): Decimal Unicode Character name Category Number of occurrences Sentences
160 U+00A0 NO-BREAK SPACE Zs 16493 sentences_with_160.tsv
8239 U+202F NARROW NO-BREAK SPACE Zs 14808 sentences_with_8239
12288 U+3000 IDEOGRAPHIC SPACE Zs 201 sentences_with_12288
8201 U+2009 THIN SPACE Zs 219 sentences_with_8201
8202 u+200A HAIR SPACE Zs 112 sentences_with_8202
1 U+0001 NONE Cc 1 sentences_with_1.tsv
127 U+007F NONE Cc 1 sentences_with_127
151 U+0097 NONE Cc 2 sentences_with_151
144 U+0090 NONE Cc 2 sentences_with_144
157 U+009D NONE Cc 1 sentences_with_157
129 U+0081 NONE Cc 1 sentences_with_129
8 U+0008 NONE Cc 1 sentences_with_8
8232 U+2028 LINE SEPARATOR Zl 2 sentences_with_8232

(The number of occurrences is sometimes higher than the number of sentences because some sentences contain more than one character)

I think we can safely replace the control characters (which include LINE FEED). For the others I'm afraid only manual inspection is safe for deciding whether the character should be replaced or not.

AndiPersti commented 4 years ago

Note that interestingly, while sentence 8649740 broke the generation of per_languages for sentences_detailed.csv, it didn’t broke sentences.csv (but I fixed it too).

That's because there are only three fields in sentences.csv. So when the sentence text was broken into two lines, the second line contained only a single field and the variable for the second field was assigned the empty string.

But in sentences_detailed.csv there are other fields after the text and so in the second line, the username was assigned to the variable for the second field and that resulted in an non-existing directory path which broke the script.

but the export script needs to be fixed

Do you have anything specific in mind? Better error reporting? Better error recovering?

jiru commented 4 years ago

Do you have anything specific in mind? Better error reporting? Better error recovering?

We can make is so that crontab output is sent to sysadmins@tatoeba.org.

We can change the way mysql writes linefeed characters, or change how the awk command parses the file.

ckjpn commented 4 years ago

This files had the same problem this week. sentences_detailed.tar.bz2

jiru commented 4 years ago

@ckjpn Sorry about that! I manually fixed the exported files again. Could you fix sentence 8649740 before next Saturday?

jiru commented 4 years ago

I found a way to avoid linefeeds in the CSV files: mysql --batch. It replaces linefeeds with \n instead of \ + an actual linefeed. Compare CSV export:

Using mysql’s INTO OUTFILE (what we do now):

vagrant@stretch:~/Tatoeba$ echo "select concat(" $(for i in $(seq 0 127); do echo -n "'$i:',CHAR($i),"; done) "'') as test INTO OUTFILE '/var/tmp/test.csv';" | sudo mysql tatoeba
vagrant@stretch:~/Tatoeba$ cat /var/tmp/test.csv 
0:\01:2:3:4:5:6:7:89:\  10:\
11:
   12:
14:15:16:17:18:19:20:21:22:23:24:25:26:�27:8:29:30:31:32: 33:!34:"35:#36:$37:%38:&39:'40:(41:)42:*43:+44:,45:-46:.47:/48:049:150:251:352:453:554:655:756:857:958::59:;60:<61:=62:>63:?64:@65:A66:B67:C68:D69:E70:F71:G72:H73:I74:J75:K76:L77:M78:N79:O80:P81:Q82:R83:S84:T85:U86:V87:W88:X89:Y90:Z91:[92:\\93:]94:^95:_96:`97:a98:b99:c100:d101:e102:f103:g104:h105:i106:j107:k108:l109:m110:n111:o112:p113:q114:r115:s116:t117:u118:v119:w120:x121:y122:z123:{124:|125:}126:~127:

Using mysql --batch:

vagrant@stretch:~/Tatoeba$ echo "select concat(" $(for i in $(seq 0 127); do echo -n "'$i:',CHAR($i),"; done) "'') as test" | mysql tatoeba --batch
test
0:\01:2:3:4:5:6:7:89:\t10:\n11:
                               12:
14:15:16:17:18:19:20:21:22:23:24:25:26:�27:8:29:30:31:32: 33:!34:"35:#36:$37:%38:&39:'40:(41:)42:*43:+44:,45:-46:.47:/48:049:150:251:352:453:554:655:756:857:958::59:;60:<61:=62:>63:?64:@65:A66:B67:C68:D69:E70:F71:G72:H73:I74:J75:K76:L77:M78:N79:O80:P81:Q82:R83:S84:T85:U86:V87:W88:X89:Y90:Z91:[92:\\93:]94:^95:_96:`97:a98:b99:c100:d101:e102:f103:g104:h105:i106:j107:k108:l109:m110:n111:o112:p113:q114:r115:s116:t117:u118:v119:w120:x121:y122:z123:{124:|125:}126:~127:

In addition, mysql --batch converts tabs into \t, and it doesn’t require root! :rocket:

ckjpn commented 4 years ago

I may be misunderstanding this, but are you suggesting that we allow \n in a sentence's text field.

If so, I have a feeling that this would be undesirable.

AndiPersti commented 4 years ago

@jiru Nice!

I may be misunderstanding this, but are you suggesting that we allow \n in a sentence's text field.

No, we don't allow a linefeed (U+00A0) in a sentence text. But there was a bug were we missed one and it broke the export script. This should be fixed.

But there are still other places where we allow them (wall posts, sentence comments, user language description, ...) and for these fields, embedded linefeeds should be properly escaped in the exports.

ckjpn commented 4 years ago

I always have to change the exported comments file to get each record on one line.

I do this by replacing what's there for linefeeds to the HTML <br />.

This way, I can easily grab every line with a comment by myself.

Is this the kind of thing you are talking about?

AndiPersti commented 4 years ago

Yes, these exports should be easier to handle in the future.

LBeaudoux commented 4 years ago

@jiru I'm glad you found a fix to this multi-line entry issue. But at least one pain point remains when it comes to reading the weekly dump files. You get an extra element when you split your row in the case there is a tab in a field. It's rare but it happens, e.g. the name of the well-known 907 list created by CK starts with a tab. "907 CK 2012-02-04 07:48:39 2020-05-20 03:57:18 \ ! #01 - Proofread Good English Sentences That CK Uses on His Projects - List 907 - Over 750,000 Sentences creator" becomes ['907', 'CK', '2012-02-04 07:48:39', '2020-05-20 03:57:18', '\\', ' ! #01 - Proofread Good English Sentences That CK Uses on His Projects - List 907 - Over 750,000 Sentences', 'creator'] In the case of the 'queries.csv' file, a simple comma is used as field delimiter, which makes it even more problematic because many queries contain commas. A solution may be to enclose fields by double quotation marks as seen at https://www.mysqltutorial.org/mysql-export-table-to-csv/

AndiPersti commented 4 years ago

@jiru I'm glad you found a fix to this multi-line entry issue. But at least one pain point remains when it comes to reading the weekly dump files. You get an extra element when you split your row in the case there is a tab in a field. It's rare but it happens, e.g. the name of the well-known 907 list created by CK starts with a tab.

IMHO a list name shouldn't have a tab and thus this is a bug and we should properly clean the list name before storing it in the database. I've just created #2371.

In the case of the 'queries.csv' file, a simple comma is used as field delimiter, which makes it even more problematic because many queries contain commas.

The data in that file isn't from MariaDB. It is an edited version from Manticore's log file (our search engine) and the original doesn't contain any delimiters (see Plain log format). So for the next time (not sure whether @jiru plans to update that file in the future) I guess we should use another delimiter.

jiru commented 4 years ago

But at least one pain point remains when it comes to reading the weekly dump files. You get an extra element when you split your row in the case there is a tab in a field.

This problem will also be solved by mysql --batch as tabs are converted to \t:

$ mysql --batch tatoeba -e "select id, name from sentences_lists where id = 907"
id  name
907 \t ! #01 - Proofread Good English Sentences That CK Uses on His Projects - List 907 - Over 750,000 Sentences
jiru commented 4 years ago

In the case of the 'queries.csv' file, a simple comma is used as field delimiter, which makes it even more problematic because many queries contain commas.

The data in that file isn't from MariaDB. It is an edited version from Manticore's log file (our search engine) and the original doesn't contain any delimiters (see Plain log format).

Yeah I confirm that was a quick and dirty extract from the log files. Nothing like a proper export. :wink:

jiru commented 4 years ago

And mysql --batch outputs the string NULL for null values, as opposed to \N in our current exports. Along with the new \n and \t, that will certainly cause some trouble to people making use of our exports. Any idea on how to make the transition smoother?

LBeaudoux commented 4 years ago

The IFNULL() function may help you replace the NULL values by \N. I found this blog post that deals with the very topic.

jiru commented 4 years ago

Good point! :blush: