Support UTF-8 identifiers in ctangle output.

ascherer commented 3 years ago

Feature request and patch by @igor-liferenko.

Although I can apply the patch to ctangle.w and get the desired effect with the +u option, gcc 9.3.0 on (K)Ubuntu 20.04 LTS can't cope with identifiers with UTF-8 characters. It appears that UTF-8 support comes with gcc 10. Plus, there's a significant amount of spit and polish to be applied in order to integrate that small patch in the code base (test, doc, etc.).

ascherer commented 3 years ago

As a first step, it might be possible to skip #c3 and use ecma94 to transliterate some umlauts.

ascherer commented 3 years ago

Actually, the second bytes of UTF-8 characters (after #c3) do not correspond to ecma94, so a different transliteration table would be necessary. For the “usual suspects” of the German language, the following modification works:

@x
    else C_printf("%s",translit[(unsigned char)(*j)-0200]);
@y
    else {
      if (flags['u']) C_putc(*j);
      else {
        if (0303==(unsigned char)(*j)) ++j;
        C_printf("%s",translit[(unsigned char)(*j)-0200]);
      }
    }
@z

(skip #c3 and transliterate the next byte) and the resulting uctangle processes the input file

@l 84 Ae
@l 96 Oe
@l 9c Ue
@l 9f ss
@l a4 ae
@l b6 oe
@l bc ue

@* Igor.

@c
int main(void)
{
    int fröhlicheWeihnacht = 42;
    int ätscheBÄH = 100;
    int ÄÖÜßäöü = 666;
    return 0;
}

@* Index.

into the expected output

/*1:*/
#line 11 "utest.w"

int main(void)
{
int froehlicheWeihnacht= 42;
int aetscheBAeH= 100;
int AeOeUessaeoeue= 666;
return 0;
}

/*:1*/

I have absolutely no idea if the above change breaks any legal CWEB input. A quick glance at the ISO-8859-1 table and some cross-calculation shows that also magic number #c2 might come into play.

ascherer commented 3 years ago

This issue is related to issue #8.

ascherer commented 2 years ago

I completely forgot about my own ideas expressed above. Only after watching both videos of @dylanbeattie's talk on “Plain Text” (NDC Oslo 2021, NDC Copenhagen 2022) it clicked with me. As the (partial) fix for issue #8 makes use of the +u option in a different manner, I close this issue. CWEB is far too conservative to issue UTF-8. On my Linux box, gcc 9 can't grok UTF-8 identifiers anyway. (I have seen the future on my Mac Mini with CLang 13, though.)

ascherer / cwebbin

Support UTF-8 identifiers in ctangle output. #42