Linefeeds in sentence text are breaking the weekly export script

ckjpn commented 4 years ago

There may be other problems, but here is the one in sentences_detailed.csv

8649740 eng 817/5000\ Achieve easy success. pelwan 2020-04-01 04:28:01 2020-04-01 04:28:01

My scripts I have from processing the weekly files wouldn't work because of this.

ckjpn commented 4 years ago

The problem is that this item had a linefeed inserted into it.

https://tatoeba.org/eng/sentences/show/8649740

Maybe linefeeds should be filtered out before adding sentences to the database.

It looks like it may have been a copy-and-paste from a flashcard website like Quizlet.

jiru commented 4 years ago

@ckjpn I fixed the exports files if you want to download them again.

Not only the export files are not well formatted for most CSV parsing tools (except MySQL’s), but the export halted half way with the following error:

awk: cmd. line:1: (FILENAME=sentences_detailed.csv FNR=8243386) fatal: can't redirect to `/var/tmp/per_language/pelwan/pelwan_sentences_detailed.tsv' (No such file or directory)

As a result, the per_language files did not get generated.

So I manually edited this week’s sentences.csv and sentences_detailed.csv to remove the garbage sentence 8649740 is prefixed with, and I manually executed the per_language files generation, so the exports are good for this week, but the export script needs to be fixed and we want to prevent linefeeds in sentence text.

The faulty files are still available for debugging purpose: sentences.csv.old and sentences_detailed.csv.old. The relevant part is at lines 8243385 and 8243386.

Note that interestingly, while sentence 8649740 broke the generation of per_languages for sentences_detailed.csv, it didn’t broke sentences.csv (but I fixed it too).

AndiPersti commented 4 years ago

The problem is that the text cleaning I've introduced in 1b31f9c21cfc45cefdb1c93bf29a0beceec78239 has a major flaw: https://github.com/Tatoeba/tatoeba2/blob/436ad1252a45e7d33e4f2405962bea6fae811d29/src/Model/Entity/Sentence.php#L70-L72 This only replaces a series of at least 2 whitespace or control characters with a single space. But I don't think we can simply replace {2,} with + to replace also single occurrences of these characters because than there will be problems with eg French sentences which use different spaces in sentences (see #770).

I've checked all exported sentences for whitespace and control characters (except the normal space): Decimal	Unicode	Character name	Category	Number of occurrences	Sentences
160	U+00A0	NO-BREAK SPACE	Zs	16493	sentences_with_160.tsv
8239	U+202F	NARROW NO-BREAK SPACE	Zs	14808	sentences_with_8239
12288	U+3000	IDEOGRAPHIC SPACE	Zs	201	sentences_with_12288
8201	U+2009	THIN SPACE	Zs	219	sentences_with_8201
8202	u+200A	HAIR SPACE	Zs	112	sentences_with_8202
1	U+0001	NONE	Cc	1	sentences_with_1.tsv
127	U+007F	NONE	Cc	1	sentences_with_127
151	U+0097	NONE	Cc	2	sentences_with_151
144	U+0090	NONE	Cc	2	sentences_with_144
157	U+009D	NONE	Cc	1	sentences_with_157
129	U+0081	NONE	Cc	1	sentences_with_129
8	U+0008	NONE	Cc	1	sentences_with_8
8232	U+2028	LINE SEPARATOR	Zl	2	sentences_with_8232

(The number of occurrences is sometimes higher than the number of sentences because some sentences contain more than one character)

I think we can safely replace the control characters (which include LINE FEED). For the others I'm afraid only manual inspection is safe for deciding whether the character should be replaced or not.

AndiPersti commented 4 years ago

Note that interestingly, while sentence 8649740 broke the generation of per_languages for sentences_detailed.csv, it didn’t broke sentences.csv (but I fixed it too).

That's because there are only three fields in sentences.csv. So when the sentence text was broken into two lines, the second line contained only a single field and the variable for the second field was assigned the empty string.

But in sentences_detailed.csv there are other fields after the text and so in the second line, the username was assigned to the variable for the second field and that resulted in an non-existing directory path which broke the script.

but the export script needs to be fixed

Do you have anything specific in mind? Better error reporting? Better error recovering?

jiru commented 4 years ago

Do you have anything specific in mind? Better error reporting? Better error recovering?

We can make is so that crontab output is sent to sysadmins@tatoeba.org.

We can change the way mysql writes linefeed characters, or change how the awk command parses the file.

ckjpn commented 4 years ago

This files had the same problem this week. sentences_detailed.tar.bz2

jiru commented 4 years ago

@ckjpn Sorry about that! I manually fixed the exported files again. Could you fix sentence 8649740 before next Saturday?

jiru commented 4 years ago

I found a way to avoid linefeeds in the CSV files: mysql --batch. It replaces linefeeds with \n instead of \ + an actual linefeed. Compare CSV export:

Using mysql’s INTO OUTFILE (what we do now):

vagrant@stretch:~/Tatoeba$ echo "select concat(" $(for i in $(seq 0 127); do echo -n "'$i:',CHAR($i),"; done) "'') as test INTO OUTFILE '/var/tmp/test.csv';" | sudo mysql tatoeba
vagrant@stretch:~/Tatoeba$ cat /var/tmp/test.csv 
0:\01:2:3:4:5:6:7:89:\  10:\
11:
   12:
14:15:16:17:18:19:20:21:22:23:24:25:26:�27:8:29:30:31:32: 33:!34:"35:#36:$37:%38:&39:'40:(41:)42:*43:+44:,45:-46:.47:/48:049:150:251:352:453:554:655:756:857:958::59:;60:<61:=62:>63:?64:@65:A66:B67:C68:D69:E70:F71:G72:H73:I74:J75:K76:L77:M78:N79:O80:P81:Q82:R83:S84:T85:U86:V87:W88:X89:Y90:Z91:[92:\\93:]94:^95:_96:`97:a98:b99:c100:d101:e102:f103:g104:h105:i106:j107:k108:l109:m110:n111:o112:p113:q114:r115:s116:t117:u118:v119:w120:x121:y122:z123:{124:|125:}126:~127:

Using mysql --batch:

vagrant@stretch:~/Tatoeba$ echo "select concat(" $(for i in $(seq 0 127); do echo -n "'$i:',CHAR($i),"; done) "'') as test" | mysql tatoeba --batch
test
0:\01:2:3:4:5:6:7:89:\t10:\n11:
                               12:
14:15:16:17:18:19:20:21:22:23:24:25:26:�27:8:29:30:31:32: 33:!34:"35:#36:$37:%38:&39:'40:(41:)42:*43:+44:,45:-46:.47:/48:049:150:251:352:453:554:655:756:857:958::59:;60:<61:=62:>63:?64:@65:A66:B67:C68:D69:E70:F71:G72:H73:I74:J75:K76:L77:M78:N79:O80:P81:Q82:R83:S84:T85:U86:V87:W88:X89:Y90:Z91:[92:\\93:]94:^95:_96:`97:a98:b99:c100:d101:e102:f103:g104:h105:i106:j107:k108:l109:m110:n111:o112:p113:q114:r115:s116:t117:u118:v119:w120:x121:y122:z123:{124:|125:}126:~127:

In addition, mysql --batch converts tabs into \t, and it doesn’t require root! :rocket:

ckjpn commented 4 years ago

I may be misunderstanding this, but are you suggesting that we allow \n in a sentence's text field.

If so, I have a feeling that this would be undesirable.

AndiPersti commented 4 years ago

@jiru Nice!

I may be misunderstanding this, but are you suggesting that we allow \n in a sentence's text field.

No, we don't allow a linefeed (U+00A0) in a sentence text. But there was a bug were we missed one and it broke the export script. This should be fixed.

But there are still other places where we allow them (wall posts, sentence comments, user language description, ...) and for these fields, embedded linefeeds should be properly escaped in the exports.

ckjpn commented 4 years ago

I always have to change the exported comments file to get each record on one line.

I do this by replacing what's there for linefeeds to the HTML <br />.

This way, I can easily grab every line with a comment by myself.

Is this the kind of thing you are talking about?

AndiPersti commented 4 years ago

Yes, these exports should be easier to handle in the future.

LBeaudoux commented 4 years ago

@jiru I'm glad you found a fix to this multi-line entry issue. But at least one pain point remains when it comes to reading the weekly dump files. You get an extra element when you split your row in the case there is a tab in a field. It's rare but it happens, e.g. the name of the well-known 907 list created by CK starts with a tab. "907 CK 2012-02-04 07:48:39 2020-05-20 03:57:18 \ ! #01 - Proofread Good English Sentences That CK Uses on His Projects - List 907 - Over 750,000 Sentences creator" becomes ['907', 'CK', '2012-02-04 07:48:39', '2020-05-20 03:57:18', '\\', ' ! #01 - Proofread Good English Sentences That CK Uses on His Projects - List 907 - Over 750,000 Sentences', 'creator'] In the case of the 'queries.csv' file, a simple comma is used as field delimiter, which makes it even more problematic because many queries contain commas. A solution may be to enclose fields by double quotation marks as seen at https://www.mysqltutorial.org/mysql-export-table-to-csv/

AndiPersti commented 4 years ago

@jiru I'm glad you found a fix to this multi-line entry issue. But at least one pain point remains when it comes to reading the weekly dump files. You get an extra element when you split your row in the case there is a tab in a field. It's rare but it happens, e.g. the name of the well-known 907 list created by CK starts with a tab.

IMHO a list name shouldn't have a tab and thus this is a bug and we should properly clean the list name before storing it in the database. I've just created #2371.

In the case of the 'queries.csv' file, a simple comma is used as field delimiter, which makes it even more problematic because many queries contain commas.

The data in that file isn't from MariaDB. It is an edited version from Manticore's log file (our search engine) and the original doesn't contain any delimiters (see Plain log format). So for the next time (not sure whether @jiru plans to update that file in the future) I guess we should use another delimiter.

jiru commented 4 years ago

But at least one pain point remains when it comes to reading the weekly dump files. You get an extra element when you split your row in the case there is a tab in a field.

This problem will also be solved by mysql --batch as tabs are converted to \t:

$ mysql --batch tatoeba -e "select id, name from sentences_lists where id = 907"
id  name
907 \t ! #01 - Proofread Good English Sentences That CK Uses on His Projects - List 907 - Over 750,000 Sentences

jiru commented 4 years ago

In the case of the 'queries.csv' file, a simple comma is used as field delimiter, which makes it even more problematic because many queries contain commas.

The data in that file isn't from MariaDB. It is an edited version from Manticore's log file (our search engine) and the original doesn't contain any delimiters (see Plain log format).

Yeah I confirm that was a quick and dirty extract from the log files. Nothing like a proper export. :wink:

jiru commented 4 years ago

And mysql --batch outputs the string NULL for null values, as opposed to \N in our current exports. Along with the new \n and \t, that will certainly cause some trouble to people making use of our exports. Any idea on how to make the transition smoother?

LBeaudoux commented 4 years ago

The IFNULL() function may help you replace the NULL values by \N. I found this blog post that deals with the very topic.

jiru commented 4 years ago

Good point! :blush:

Tatoeba / tatoeba2

Linefeeds in sentence text are breaking the weekly export script #2250