Status Checkup - Githubissues

clayrisser commented 7 years ago

This project is sooo historically cool. I would love to know the status of the project. I haven't seen any activity for several months. I'm also will to contribute if you need more manpower.

WardCunningham commented 7 years ago

Most of my attention goes into my new wiki, federated wiki, which is well positioned to be historically cool in a few decades. See http://about.fed.wiki.

Still, thanks for asking. I continue to pay co-location fees because I don't want to lose the few hundred pages that I wasn't able to recover mostly due to mixed character encoding problems.

I'm not looking for advice on how to become a better programmer. But I would appreciate some help. I could put together some tar files with troublesome pages and various backups. If you or anyone else had a good approach for converting these to utf-8 I'd love to see this work done.

clayrisser commented 7 years ago

What encoding are the troublesome files in?

clayrisser commented 7 years ago

So, is http://wiki.c2.com/ going to be permanently frozen, or are there plans on opening it up again?

clayrisser commented 7 years ago

I read through the explanations of the Federation wiki, but it's pretty dense, and I don't fully understand its purpose.

WardCunningham commented 7 years ago

The troublesome files are in mixed encoding, having been edited by a variety of browser at a time where utf-8 was uncommon. Federated wiki offers an alternative (and editable) view into historic wiki pages. This javascript is more faithful to the original perl code. Wiki is more a medium, like paper, than a tool with a purpose, like a stapler. Federated wiki is a medium for doing work as well as talking about doing work. I do have trouble devoting energy here in the past but I would be glad to work on it with others.

maxlybbert commented 5 years ago

If you’re still interested in fixing the “troublesome files,” it sounds like an interesting problem. I’m not aware of any existing tool to autodetect the encoding of a part of a file, but I’m optimistic. Would you have time to get a few samples?

WardCunningham commented 5 years ago

I will prepare a sampling of troublesome pages and post a link to a tar file here. This repo has the ruby and c programs I used to convert most pages to json. The c program, json.c => a.out, converts troublesome characters to something that can be recognized by the ruby program, json.rb. The one character I had to convert to get anything working was the ASCII GS (group separator) character that I had used in my original perl code to separate groups. I suspect a large number of troublesome files can be handled by adding more cases to json.c. But what are the cases? That is the question that slowed down my conversion.

Here I resort again to perl to count byte occurrences in a known good file.

cat wiki.wdb/WardCunningham | \
  perl -e 'while(read STDIN,$text,1024){@bytes=unpack"C*",$text;for(@bytes){printf"%03o\n",$_}}' | \
  sort | uniq -c

Where for this file I get these counts:

Below 040 are ASCII control codes. Here we see TAB, LF and CR. Above 177 are 8-bit codes, 7-bits plus the high bit, 200 octal. I see here that I'm using octal code 263 as group separator. I vaguely remember switching to this unlikely code but don't remember why.

WardCunningham commented 5 years ago

I've put together a list of troublesome pages.

http://c2.com/wiki/remodel/trouble.txt

I'm also serving these in their original (not html) format which would feed into the json.c and json.rb scripts in this repo. I've picked one largish example from this list for character distribution analysis.

http://wiki.c2.com/?XpAsTroubleDetector

Here I run the perl script from above on this using the original format file as input:

curl -s http://c2.com/wiki/remodel/trouble/XpAsTroubleDetector | \
  perl -e 'while(read STDIN,$text,1024){@bytes=unpack"C*",$text;for(@bytes){printf"%03o\n",$_}}' | \
  sort | uniq -c

Where for this input I get these counts:

It's possible that this is a particularly tough case. Some sort of systematic study is in order ranking troublesome page names by, say, the number of unexpected character codes.

To aid in such a study I have assembled all troublesome files in one compressed tar file.

http://c2.com/wiki/remodel/trouble.tgz

I would be pleased to see some progress on any substantial number of these file.

WardCunningham commented 5 years ago

Some progress in this pull request: #32

maxlybbert commented 5 years ago

I’m sorry I didn’t look at this over the weekend. Even so, you’ve made a lot of progress pretty quickly. Hopefully I’ll be able to do something helpful before you’ve solved the problem.

WardCunningham commented 5 years ago

The thing I had missed was the Chinese spam. Most often it had been reverted but since I kept a copy of the last version in the same file the characters there killed my ruby program. The other insight I was missing was that I had line oriented files and could narrow my problem characters down to one line. Still plenty of random characters from pre-utf-8 character encodings.

maxlybbert commented 5 years ago

I've played around with Perl's Encode::Guess module, and the early results are promising. I used the following script, and most of the non-utf8 portions are in the Windows' version of Latin1. And most of the exceptions are the Chinese spam:

#!/bin/env perl

# improved Unicode support starting with 5.14
use v5.14;
use warnings;

use constant codepages => qw{WinLatin1 latin1 euc-jp shiftjis 7bit-jis};

use Encode;
use Encode::Guess codepages;
use List::Util 'all';

Encode::Guess->set_suspects(codepages);

binmode STDOUT, ':utf8';

for my $filename (map { glob } @ARGV) {
    my $fh;
    if (!open $fh, '<', $filename) {
        warn "cannot open $filename: $!\n";
        next;
    }
    while (my $line = <$fh>) {
        chomp $line;
        # Even though utf8 is defined so that 7-bit ASCII is valid utf8,
        # utf8::is_utf8 returns false when given a string of just
        # 7-bit ASCII.  So test for 7-bit ASCII separately (and assume
        # it's encoded correctly).
        if (all { ord($_) < 128 } split //, $line) {
            next;
        }
        # It is possible to get a false negative (e.g., Latin1 text
        # which happens to have all characters with values above 127
        # followed by characters with values of 127 or less), but
        # it's very unlikely.
        if (utf8::is_utf8($line)) {
            next;
        }

        my $enc = guess_encoding($line);
        if (!defined $enc) {
            warn "cannot guess encoding for $line\n";
            next;
        }
        if (ref $enc) {
            say "$filename:$. (" . $enc->name . ")\t"
                . $enc->decode($line);
            next;
        }
        for (split /\s+or\s+/, $enc) {
            say "$filename:$. ($_)\t" . Encode::decode($_, $line);
        }
    }
}

I believe it wouldn't be hard to write a script to filter out the spam and correct the encodings (convert to Windows Latin1 by default, but mark specific files that need a different conversion). I'll do that next.

maxlybbert commented 5 years ago

Oops. utf8::is_utf8 doesn’t do what I thought. The script should be:

#!/bin/env perl

# improved Unicode support starting with 5.14
use v5.14;
use warnings;

use constant codepages => qw{WinLatin1 latin1 euc-jp shiftjis 7bit-jis};

use Encode;
use Encode::Guess codepages;

Encode::Guess->set_suspects(codepages);

binmode STDOUT, ':utf8';

for my $filename (map { glob } @ARGV) {
    my $fh;
    if (!open $fh, '<', $filename) {
        warn "cannot open $filename: $!\n";
        next;
    }
    while (my $line = <$fh>) {
        chomp $line;
        my $copy = $line;
        if (utf8::decode($copy)) {
            next;
        }

        my $enc = guess_encoding($line);
        if (!defined $enc) {
            warn "cannot guess encoding for $line\n";
            next;
        }
        if (ref $enc) {
            say "$filename:$. (" . $enc->name . ")\t"
                . $enc->decode($line);
            next;
        }
        for (split /\s+or\s+/, $enc) {
            say "$filename:$. ($_)\t" . Encode::decode($_, $line);
        }
    }
}

WardCunningham commented 5 years ago

This is an amazingly helpful script. I thought it might be possible but didn't know enough about encoding to even begin.

I added a substitution for the $SEP character I used in my serializations. I know it won't collide with any other alphabet because I removed them from submitted text on save before I serialize.

my $SEP = "\263";

Can I assume that the result of $enc->decode($line); is utf-8? If so, it seems like I have all of the pieces I need to convert 99% of my files.

Aside: Wikipedia has been helpful explaining each of the encodings suggested by your script.

maxlybbert commented 5 years ago

I re-read the documentation to be sure about whether $enc->encode($line) always returns utf-8. It does, with the caveat that $enc can be either an object that can convert to utf-8 or an error message. I got that wrong: I thought it was a list of candidates, which is I have the split. I really got that wrong, because presumably there is some text before the first encoding name, and I don’t strip it out. But I already have a list of the code pages I asked for, so there’s no need to try to figure out that list from $enc.

As it currently exists, the script has serious problems. But I am glad that it provides a decent starting point for an actual conversion script.

maxlybbert commented 5 years ago

I checked what $enc has on error, and it does get an “or”-separated list of candidates. Which is nice, since Encode::Guess figures out the encoding only 127 times, compared to 23,000 times where more than one code page could be right.

maxlybbert commented 5 years ago

I recently discovered that ICU ( http://icu-project.org ) supports encoding detection, so I wrote a short C++ program that detects the encoding, line-by-line, and actually performs the encoding. Unfortunately, some encodings that ICU detects aren't properly set up on my computer (e.g., IBM424_rtl and IBM424_rtl), so actually trying those encodings fails when I run my program. Those encodings seem to show up mainly in spam links, so getting them properly decoded may not be such a big issue. It so happens that falling back to reading that text as UTF-8 gives me mojibake, but doesn't throw an error. You may have better luck on a different computer.

Github won't allow me to attach a tarball of the processed files. I would be happy to email it to you, or send it some other way. I have attached the C++ program (as a .txt, because Github won't accept it with a .cc extension). It's not an efficient program (it uses functions that ICU refers to as inefficient convenience functions), but it runs fast enough for me. icudet.txt

maxlybbert commented 5 years ago

I have some changes I want to make to my C++ program. I think I’m wrong about getting mojibake when I fall back to encoding by UTF-8. Instead, I think I’m getting “invalid conversion” characters.

I won’t be able to fix the program until tonight at the earliest. If you want to make the changes: I plan to ask ICU to give me a list of candidates (instead of just the best candidate) and exhaust those before I fall back to just trying everything, plus I plan to change the check for whether something was successfully decoded.

WardCunningham commented 5 years ago

Thank you for your continued effort here.

There are often two copies of a page in each file. If the spam associated encodings are in one version only that would indicate a preference for the other. This, can ruby read it, was my first discrimination between copies and seemed to handle a lot of cases. This might be asking a lot of your program unless it is already unpacking the parts.

maxlybbert commented 5 years ago

I’m currently only going line-by-line. I don’t think it would be hard to process just the de-spammed portion of each file, though.

marnen commented 4 years ago

I would very much like to help get the remaining wiki pages operational, but the tarball is hosted on c2.com, which now seems to be down. Can we make a GitHub repo with the remaining page content, and use the pull request workflow to facilitate the cleanup?

WardCunningham / remodeling

Status Checkup #24