Open clayrisser opened 7 years ago
Most of my attention goes into my new wiki, federated wiki, which is well positioned to be historically cool in a few decades. See http://about.fed.wiki.
Still, thanks for asking. I continue to pay co-location fees because I don't want to lose the few hundred pages that I wasn't able to recover mostly due to mixed character encoding problems.
I'm not looking for advice on how to become a better programmer. But I would appreciate some help. I could put together some tar files with troublesome pages and various backups. If you or anyone else had a good approach for converting these to utf-8 I'd love to see this work done.
What encoding are the troublesome files in?
So, is http://wiki.c2.com/ going to be permanently frozen, or are there plans on opening it up again?
I read through the explanations of the Federation wiki, but it's pretty dense, and I don't fully understand its purpose.
The troublesome files are in mixed encoding, having been edited by a variety of browser at a time where utf-8 was uncommon. Federated wiki offers an alternative (and editable) view into historic wiki pages. This javascript is more faithful to the original perl code. Wiki is more a medium, like paper, than a tool with a purpose, like a stapler. Federated wiki is a medium for doing work as well as talking about doing work. I do have trouble devoting energy here in the past but I would be glad to work on it with others.
If you’re still interested in fixing the “troublesome files,” it sounds like an interesting problem. I’m not aware of any existing tool to autodetect the encoding of a part of a file, but I’m optimistic. Would you have time to get a few samples?
I will prepare a sampling of troublesome pages and post a link to a tar file here. This repo has the ruby and c programs I used to convert most pages to json. The c program, json.c => a.out, converts troublesome characters to something that can be recognized by the ruby program, json.rb. The one character I had to convert to get anything working was the ASCII GS (group separator) character that I had used in my original perl code to separate groups. I suspect a large number of troublesome files can be handled by adding more cases to json.c. But what are the cases? That is the question that slowed down my conversion.
Here I resort again to perl to count byte occurrences in a known good file.
cat wiki.wdb/WardCunningham | \
perl -e 'while(read STDIN,$text,1024){@bytes=unpack"C*",$text;for(@bytes){printf"%03o\n",$_}}' | \
sort | uniq -c
Where for this file I get these counts:
8 011
975 012
975 015
10451 040
14 041
122 042
4 045
24 046
930 047
75 050
81 051
202 052
2 053
415 054
707 055
945 056
277 057
150 060
82 061
119 062
46 063
48 064
58 065
44 066
37 067
48 070
46 071
121 072
12 073
26 075
126 077
4 100
142 101
120 102
165 103
86 104
120 105
77 106
43 107
94 110
455 111
46 112
24 113
88 114
125 115
65 116
106 117
146 120
12 121
72 122
261 123
227 124
36 125
14 126
392 127
10 130
14 131
2 132
10 133
10 135
26 137
3852 141
751 142
1518 143
1855 144
5889 145
945 146
1259 147
2160 150
3900 151
70 152
712 153
2133 154
1397 155
3585 156
4340 157
1159 160
44 161
3226 162
3409 163
4668 164
1582 165
518 166
1091 167
198 170
887 171
41 172
2 176
19 263
Below 040 are ASCII control codes. Here we see TAB, LF and CR. Above 177 are 8-bit codes, 7-bits plus the high bit, 200 octal. I see here that I'm using octal code 263 as group separator. I vaguely remember switching to this unlikely code but don't remember why.
I've put together a list of troublesome pages.
http://c2.com/wiki/remodel/trouble.txt
I'm also serving these in their original (not html) format which would feed into the json.c and json.rb scripts in this repo. I've picked one largish example from this list for character distribution analysis.
http://wiki.c2.com/?XpAsTroubleDetector
Here I run the perl script from above on this using the original format file as input:
curl -s http://c2.com/wiki/remodel/trouble/XpAsTroubleDetector | \
perl -e 'while(read STDIN,$text,1024){@bytes=unpack"C*",$text;for(@bytes){printf"%03o\n",$_}}' | \
sort | uniq -c
Where for this input I get these counts:
10 011
73 012
73 015
3627 040
1 041
42 042
139 047
12 050
12 051
49 054
32 055
7872 056
10335 057
2146 060
8515 061
2874 062
979 063
1234 064
3384 065
3407 066
1013 067
1115 070
888 071
2599 072
3 073
15 077
6 101
6 102
5 103
6 104
2 105
4 106
4 107
6 110
35 111
1 112
2 113
3 114
5 115
2 116
3 117
28 120
11 123
18 124
5 125
1 126
11 127
26 130
3 131
2588 133
2588 135
2907 141
61 142
198 143
2674 144
3123 145
108 146
2855 147
5547 150
5356 151
24 152
39 153
2790 154
2752 155
7891 156
522 157
2825 160
6 161
281 162
342 163
10810 164
154 165
36 166
252 167
9 170
2613 171
3 172
1 242
1 245
1 250
19 260
2 262
15 263
1 264
20 265
1 266
9 267
1 270
2 273
2 276
2 277
4 302
6 303
1 305
3 312
2 314
2 315
7 317
1 320
8 321
19 323
7 324
3 325
1 326
1 327
2 330
5 332
1 333
5 337
1 341
6 342
20 347
1 354
1 355
3 370
1 372
It's possible that this is a particularly tough case. Some sort of systematic study is in order ranking troublesome page names by, say, the number of unexpected character codes.
To aid in such a study I have assembled all troublesome files in one compressed tar file.
http://c2.com/wiki/remodel/trouble.tgz
I would be pleased to see some progress on any substantial number of these file.
Some progress in this pull request: #32
I’m sorry I didn’t look at this over the weekend. Even so, you’ve made a lot of progress pretty quickly. Hopefully I’ll be able to do something helpful before you’ve solved the problem.
The thing I had missed was the Chinese spam. Most often it had been reverted but since I kept a copy of the last version in the same file the characters there killed my ruby program. The other insight I was missing was that I had line oriented files and could narrow my problem characters down to one line. Still plenty of random characters from pre-utf-8 character encodings.
I've played around with Perl's Encode::Guess
module, and the early results are promising. I used the following script, and most of the non-utf8 portions are in the Windows' version of Latin1. And most of the exceptions are the Chinese spam:
#!/bin/env perl
# improved Unicode support starting with 5.14
use v5.14;
use warnings;
use constant codepages => qw{WinLatin1 latin1 euc-jp shiftjis 7bit-jis};
use Encode;
use Encode::Guess codepages;
use List::Util 'all';
Encode::Guess->set_suspects(codepages);
binmode STDOUT, ':utf8';
for my $filename (map { glob } @ARGV) {
my $fh;
if (!open $fh, '<', $filename) {
warn "cannot open $filename: $!\n";
next;
}
while (my $line = <$fh>) {
chomp $line;
# Even though utf8 is defined so that 7-bit ASCII is valid utf8,
# utf8::is_utf8 returns false when given a string of just
# 7-bit ASCII. So test for 7-bit ASCII separately (and assume
# it's encoded correctly).
if (all { ord($_) < 128 } split //, $line) {
next;
}
# It is possible to get a false negative (e.g., Latin1 text
# which happens to have all characters with values above 127
# followed by characters with values of 127 or less), but
# it's very unlikely.
if (utf8::is_utf8($line)) {
next;
}
my $enc = guess_encoding($line);
if (!defined $enc) {
warn "cannot guess encoding for $line\n";
next;
}
if (ref $enc) {
say "$filename:$. (" . $enc->name . ")\t"
. $enc->decode($line);
next;
}
for (split /\s+or\s+/, $enc) {
say "$filename:$. ($_)\t" . Encode::decode($_, $line);
}
}
}
I believe it wouldn't be hard to write a script to filter out the spam and correct the encodings (convert to Windows Latin1 by default, but mark specific files that need a different conversion). I'll do that next.
Oops. utf8::is_utf8
doesn’t do what I thought. The script should be:
#!/bin/env perl
# improved Unicode support starting with 5.14
use v5.14;
use warnings;
use constant codepages => qw{WinLatin1 latin1 euc-jp shiftjis 7bit-jis};
use Encode;
use Encode::Guess codepages;
Encode::Guess->set_suspects(codepages);
binmode STDOUT, ':utf8';
for my $filename (map { glob } @ARGV) {
my $fh;
if (!open $fh, '<', $filename) {
warn "cannot open $filename: $!\n";
next;
}
while (my $line = <$fh>) {
chomp $line;
my $copy = $line;
if (utf8::decode($copy)) {
next;
}
my $enc = guess_encoding($line);
if (!defined $enc) {
warn "cannot guess encoding for $line\n";
next;
}
if (ref $enc) {
say "$filename:$. (" . $enc->name . ")\t"
. $enc->decode($line);
next;
}
for (split /\s+or\s+/, $enc) {
say "$filename:$. ($_)\t" . Encode::decode($_, $line);
}
}
}
This is an amazingly helpful script. I thought it might be possible but didn't know enough about encoding to even begin.
I added a substitution for the $SEP character I used in my serializations. I know it won't collide with any other alphabet because I removed them from submitted text on save before I serialize.
my $SEP = "\263";
Can I assume that the result of $enc->decode($line);
is utf-8? If so, it seems like I have all of the pieces I need to convert 99% of my files.
Aside: Wikipedia has been helpful explaining each of the encodings suggested by your script.
I re-read the documentation to be sure about whether $enc->encode($line)
always returns utf-8. It does, with the caveat that $enc
can be either an object that can convert to utf-8 or an error message. I got that wrong: I thought it was a list of candidates, which is I have the split
. I really got that wrong, because presumably there is some text before the first encoding name, and I don’t strip it out. But I already have a list of the code pages I asked for, so there’s no need to try to figure out that list from $enc
.
As it currently exists, the script has serious problems. But I am glad that it provides a decent starting point for an actual conversion script.
I checked what $enc
has on error, and it does get an “or”-separated list of candidates. Which is nice, since Encode::Guess
figures out the encoding only 127 times, compared to 23,000 times where more than one code page could be right.
I recently discovered that ICU ( http://icu-project.org ) supports encoding detection, so I wrote a short C++ program that detects the encoding, line-by-line, and actually performs the encoding. Unfortunately, some encodings that ICU detects aren't properly set up on my computer (e.g., IBM424_rtl and IBM424_rtl), so actually trying those encodings fails when I run my program. Those encodings seem to show up mainly in spam links, so getting them properly decoded may not be such a big issue. It so happens that falling back to reading that text as UTF-8 gives me mojibake, but doesn't throw an error. You may have better luck on a different computer.
Github won't allow me to attach a tarball of the processed files. I would be happy to email it to you, or send it some other way. I have attached the C++ program (as a .txt, because Github won't accept it with a .cc extension). It's not an efficient program (it uses functions that ICU refers to as inefficient convenience functions), but it runs fast enough for me. icudet.txt
I have some changes I want to make to my C++ program. I think I’m wrong about getting mojibake when I fall back to encoding by UTF-8. Instead, I think I’m getting “invalid conversion” characters.
I won’t be able to fix the program until tonight at the earliest. If you want to make the changes: I plan to ask ICU to give me a list of candidates (instead of just the best candidate) and exhaust those before I fall back to just trying everything, plus I plan to change the check for whether something was successfully decoded.
Thank you for your continued effort here.
There are often two copies of a page in each file. If the spam associated encodings are in one version only that would indicate a preference for the other. This, can ruby read it, was my first discrimination between copies and seemed to handle a lot of cases. This might be asking a lot of your program unless it is already unpacking the parts.
I’m currently only going line-by-line. I don’t think it would be hard to process just the de-spammed portion of each file, though.
I would very much like to help get the remaining wiki pages operational, but the tarball is hosted on c2.com, which now seems to be down. Can we make a GitHub repo with the remaining page content, and use the pull request workflow to facilitate the cleanup?
This project is sooo historically cool. I would love to know the status of the project. I haven't seen any activity for several months. I'm also will to contribute if you need more manpower.