Closed ckjpn closed 3 years ago
The problem is that this item had a linefeed inserted into it.
https://tatoeba.org/eng/sentences/show/8649740
Maybe linefeeds should be filtered out before adding sentences to the database.
It looks like it may have been a copy-and-paste from a flashcard website like Quizlet.
@ckjpn I fixed the exports files if you want to download them again.
Not only the export files are not well formatted for most CSV parsing tools (except MySQL’s), but the export halted half way with the following error:
awk: cmd. line:1: (FILENAME=sentences_detailed.csv FNR=8243386) fatal: can't redirect to `/var/tmp/per_language/pelwan/pelwan_sentences_detailed.tsv' (No such file or directory)
As a result, the per_language files did not get generated.
So I manually edited this week’s sentences.csv and sentences_detailed.csv to remove the garbage sentence 8649740 is prefixed with, and I manually executed the per_language files generation, so the exports are good for this week, but the export script needs to be fixed and we want to prevent linefeeds in sentence text.
The faulty files are still available for debugging purpose: sentences.csv.old and sentences_detailed.csv.old. The relevant part is at lines 8243385 and 8243386.
Note that interestingly, while sentence 8649740 broke the generation of per_languages for sentences_detailed.csv, it didn’t broke sentences.csv (but I fixed it too).
The problem is that the text cleaning I've introduced in 1b31f9c21cfc45cefdb1c93bf29a0beceec78239 has a major flaw:
https://github.com/Tatoeba/tatoeba2/blob/436ad1252a45e7d33e4f2405962bea6fae811d29/src/Model/Entity/Sentence.php#L70-L72
This only replaces a series of at least 2 whitespace or control characters with a single space.
But I don't think we can simply replace {2,}
with +
to replace also single occurrences of these characters because than there will be problems with eg French sentences which use different spaces in sentences (see #770).
I've checked all exported sentences for whitespace and control characters (except the normal space): Decimal | Unicode | Character name | Category | Number of occurrences | Sentences |
---|---|---|---|---|---|
160 | U+00A0 | NO-BREAK SPACE | Zs | 16493 | sentences_with_160.tsv |
8239 | U+202F | NARROW NO-BREAK SPACE | Zs | 14808 | sentences_with_8239 |
12288 | U+3000 | IDEOGRAPHIC SPACE | Zs | 201 | sentences_with_12288 |
8201 | U+2009 | THIN SPACE | Zs | 219 | sentences_with_8201 |
8202 | u+200A | HAIR SPACE | Zs | 112 | sentences_with_8202 |
1 | U+0001 | NONE | Cc | 1 | sentences_with_1.tsv |
127 | U+007F | NONE | Cc | 1 | sentences_with_127 |
151 | U+0097 | NONE | Cc | 2 | sentences_with_151 |
144 | U+0090 | NONE | Cc | 2 | sentences_with_144 |
157 | U+009D | NONE | Cc | 1 | sentences_with_157 |
129 | U+0081 | NONE | Cc | 1 | sentences_with_129 |
8 | U+0008 | NONE | Cc | 1 | sentences_with_8 |
8232 | U+2028 | LINE SEPARATOR | Zl | 2 | sentences_with_8232 |
(The number of occurrences is sometimes higher than the number of sentences because some sentences contain more than one character)
I think we can safely replace the control characters (which include LINE FEED
). For the others I'm afraid only manual inspection is safe for deciding whether the character should be replaced or not.
Note that interestingly, while sentence 8649740 broke the generation of per_languages for sentences_detailed.csv, it didn’t broke sentences.csv (but I fixed it too).
That's because there are only three fields in sentences.csv
. So when the sentence text was broken into two lines, the second line contained only a single field and the variable for the second field was assigned the empty string.
But in sentences_detailed.csv
there are other fields after the text and so in the second line, the username was assigned to the variable for the second field and that resulted in an non-existing directory path which broke the script.
but the export script needs to be fixed
Do you have anything specific in mind? Better error reporting? Better error recovering?
Do you have anything specific in mind? Better error reporting? Better error recovering?
We can make is so that crontab output is sent to sysadmins@tatoeba.org.
We can change the way mysql writes linefeed characters, or change how the awk command parses the file.
This files had the same problem this week. sentences_detailed.tar.bz2
@ckjpn Sorry about that! I manually fixed the exported files again. Could you fix sentence 8649740 before next Saturday?
I found a way to avoid linefeeds in the CSV files: mysql --batch
. It replaces linefeeds with \n instead of \ + an actual linefeed. Compare CSV export:
Using mysql’s INTO OUTFILE (what we do now):
vagrant@stretch:~/Tatoeba$ echo "select concat(" $(for i in $(seq 0 127); do echo -n "'$i:',CHAR($i),"; done) "'') as test INTO OUTFILE '/var/tmp/test.csv';" | sudo mysql tatoeba
vagrant@stretch:~/Tatoeba$ cat /var/tmp/test.csv
0:\01:2:3:4:5:6:7:89:\ 10:\
11:
12:
14:15:16:17:18:19:20:21:22:23:24:25:26:�27:8:29:30:31:32: 33:!34:"35:#36:$37:%38:&39:'40:(41:)42:*43:+44:,45:-46:.47:/48:049:150:251:352:453:554:655:756:857:958::59:;60:<61:=62:>63:?64:@65:A66:B67:C68:D69:E70:F71:G72:H73:I74:J75:K76:L77:M78:N79:O80:P81:Q82:R83:S84:T85:U86:V87:W88:X89:Y90:Z91:[92:\\93:]94:^95:_96:`97:a98:b99:c100:d101:e102:f103:g104:h105:i106:j107:k108:l109:m110:n111:o112:p113:q114:r115:s116:t117:u118:v119:w120:x121:y122:z123:{124:|125:}126:~127:
Using mysql --batch
:
vagrant@stretch:~/Tatoeba$ echo "select concat(" $(for i in $(seq 0 127); do echo -n "'$i:',CHAR($i),"; done) "'') as test" | mysql tatoeba --batch
test
0:\01:2:3:4:5:6:7:89:\t10:\n11:
12:
14:15:16:17:18:19:20:21:22:23:24:25:26:�27:8:29:30:31:32: 33:!34:"35:#36:$37:%38:&39:'40:(41:)42:*43:+44:,45:-46:.47:/48:049:150:251:352:453:554:655:756:857:958::59:;60:<61:=62:>63:?64:@65:A66:B67:C68:D69:E70:F71:G72:H73:I74:J75:K76:L77:M78:N79:O80:P81:Q82:R83:S84:T85:U86:V87:W88:X89:Y90:Z91:[92:\\93:]94:^95:_96:`97:a98:b99:c100:d101:e102:f103:g104:h105:i106:j107:k108:l109:m110:n111:o112:p113:q114:r115:s116:t117:u118:v119:w120:x121:y122:z123:{124:|125:}126:~127:
In addition, mysql --batch
converts tabs into \t
, and it doesn’t require root! :rocket:
I may be misunderstanding this, but are you suggesting that we allow \n in a sentence's text field.
If so, I have a feeling that this would be undesirable.
@jiru Nice!
I may be misunderstanding this, but are you suggesting that we allow \n in a sentence's text field.
No, we don't allow a linefeed (U+00A0) in a sentence text. But there was a bug were we missed one and it broke the export script. This should be fixed.
But there are still other places where we allow them (wall posts, sentence comments, user language description, ...) and for these fields, embedded linefeeds should be properly escaped in the exports.
I always have to change the exported comments file to get each record on one line.
I do this by replacing what's there for linefeeds to the HTML <br />.
This way, I can easily grab every line with a comment by myself.
Is this the kind of thing you are talking about?
Yes, these exports should be easier to handle in the future.
@jiru I'm glad you found a fix to this multi-line entry issue. But at least one pain point remains when it comes to reading the weekly dump files. You get an extra element when you split your row in the case there is a tab in a field. It's rare but it happens, e.g. the name of the well-known 907 list created by CK starts with a tab.
"907 CK 2012-02-04 07:48:39 2020-05-20 03:57:18 \ ! #01 - Proofread Good English Sentences That CK Uses on His Projects - List 907 - Over 750,000 Sentences creator"
becomes
['907', 'CK', '2012-02-04 07:48:39', '2020-05-20 03:57:18', '\\', ' ! #01 - Proofread Good English Sentences That CK Uses on His Projects - List 907 - Over 750,000 Sentences', 'creator']
In the case of the 'queries.csv' file, a simple comma is used as field delimiter, which makes it even more problematic because many queries contain commas.
A solution may be to enclose fields by double quotation marks as seen at https://www.mysqltutorial.org/mysql-export-table-to-csv/
@jiru I'm glad you found a fix to this multi-line entry issue. But at least one pain point remains when it comes to reading the weekly dump files. You get an extra element when you split your row in the case there is a tab in a field. It's rare but it happens, e.g. the name of the well-known 907 list created by CK starts with a tab.
IMHO a list name shouldn't have a tab and thus this is a bug and we should properly clean the list name before storing it in the database. I've just created #2371.
In the case of the 'queries.csv' file, a simple comma is used as field delimiter, which makes it even more problematic because many queries contain commas.
The data in that file isn't from MariaDB. It is an edited version from Manticore's log file (our search engine) and the original doesn't contain any delimiters (see Plain log format). So for the next time (not sure whether @jiru plans to update that file in the future) I guess we should use another delimiter.
But at least one pain point remains when it comes to reading the weekly dump files. You get an extra element when you split your row in the case there is a tab in a field.
This problem will also be solved by mysql --batch
as tabs are converted to \t:
$ mysql --batch tatoeba -e "select id, name from sentences_lists where id = 907"
id name
907 \t ! #01 - Proofread Good English Sentences That CK Uses on His Projects - List 907 - Over 750,000 Sentences
In the case of the 'queries.csv' file, a simple comma is used as field delimiter, which makes it even more problematic because many queries contain commas.
The data in that file isn't from MariaDB. It is an edited version from Manticore's log file (our search engine) and the original doesn't contain any delimiters (see Plain log format).
Yeah I confirm that was a quick and dirty extract from the log files. Nothing like a proper export. :wink:
And mysql --batch
outputs the string NULL
for null values, as opposed to \N
in our current exports. Along with the new \n
and \t
, that will certainly cause some trouble to people making use of our exports. Any idea on how to make the transition smoother?
The IFNULL()
function may help you replace the NULL
values by \N
. I found this blog post that deals with the very topic.
Good point! :blush:
There may be other problems, but here is the one in sentences_detailed.csv
8649740 eng 817/5000\ Achieve easy success. pelwan 2020-04-01 04:28:01 2020-04-01 04:28:01
My scripts I have from processing the weekly files wouldn't work because of this.