fpjohnston / TECO-64

Enhanced and portable version of TECO text editor in C.
24 stars 6 forks source link

Search and replace UTF-8 #14

Closed LdBeth closed 11 months ago

LdBeth commented 11 months ago

I try to replace some UTF-8 characters in a file use FS command: (save the text below to a file and call EI)

0J<@FS/ᴇ/E/;>
^[^[

This works with TECOC since I guess that is 8bit clean. However this failed to work with TECO-64. Neither using ^Q to quote would work:

0J<@FS/^Qᴇ/E/;>
^[^[

I also tried to put ^Q before every byte of but had no luck. Input using 225i180i135i works so I'm surprised searching UTF-8 would not work.

Is this a limitation of TECO-64 or there is a way to work around this?

fpjohnston commented 11 months ago

My immediate answer is that I'm embarrassed that TECO C can do something that TECO-64 can't, so I will certainly look into it. I am not sure why it is behaving this way, but it was not an intentional limitation. I am currently preparing another version for release, and I will endeavor to include a fix for you.

Thanks for bringing this to my attention.

fpjohnston commented 11 months ago

I expect to have a new version available tomorrow, once I finish some other changes. But Unicode characters are now displayed and echoed as I think they should be:

teco -n foo Editing file: foo v`` abcdᴇghij fsᴇ`E *v abcdEghij

If you are curious, it wasn't so much that TECO C was doing anything special, nor that I had broken anything. Rather, I had provided backwards-compatibility for TECO-32's handling of 8-bit characters on VMS, and hadn't realized how that might affect users in other (and more modern) operating environments. (And to be honest, I hadn't anticipated that anyone might use TECO with UTF-8, so I never thought to test with it.)

Also, the way this will work is that there is a new bit for the E3 flag, which is enabled by default for non-VMS builds. Which reminds me that I should probably update the documentation for that.

LdBeth commented 11 months ago

Thanks for the explanation. I’m pumped up for the next release!

I don’t use TECO as my daily editor since there are many non ASCII encoded files there I need to handle. I do enjoy use TECO as a terse script language. Your TECO-64 has definitely made programming more convenient.

fpjohnston commented 11 months ago

Version 200.36.1 has been released. I will close out this issue once you have confirmed that it has been resolved.

Please note that the change I made does not affect anything in display mode, which uses ncurses to handle output, and therefore would require different, and quite possibly much more extensive, modifications.

LdBeth commented 11 months ago

Ah, I can confirm the visual display is now working, but the fs replace does not work.

Editing file: test
*fsᴇ`E``
?SRH   Search failure: '<^!><^?><^?>'
*e3&64=``
64
*E3=``
323
*ht``
abcdᴇghij
*fsᴇ`E``
?SRH   Search failure: '<^!><^?><^?>'
*0j``
*fsᴇ`E``
?SRH   Search failure: '<^!><^?><^?>'
*fsa`b``
*ht``
bbcdᴇghij
*
LdBeth commented 11 months ago

Btw I did the test on OS X, I’m going to try on Linux later today.

fpjohnston commented 11 months ago

Strange. The FS command worked for me, as in the following macro:

@I/abcdᴇfghij
/
0J
@^A/before: / HT
< @FS/ᴇ/E/; >
@^A/after:  / HT

Which prints out:

before: abcdᴇfghij
after:  abcdEfghij
fpjohnston commented 11 months ago

By the way, I had intended for TECO-64 to work on OS X, and had made some work toward porting it when I had access to a MacBook at my last job, but then Covid happened and my company had to downsize, so I don't presently have any way to test in that environment.

LdBeth commented 11 months ago

It does work on Linux to me. Could be I didn’t do a clean before rebuild on OS X. I’ll retry on OS X and report back any updates.

LdBeth commented 11 months ago

So I did some experiment and find match_str has different behavior on Linux and OS X.

First I patched src/search.c to get a trace:

--- search.c.old    2023-06-03 07:26:45.000000000 -0500
+++ search.c    2023-06-04 15:38:30.000000000 -0500
@@ -269,6 +269,7 @@

     int match = *s->match_buf++;

+    tprint("match: %d\n", match);
     if (match == CTRL_E)
     {
         if (s->match_len-- == 0)
@@ -378,6 +379,7 @@
         {
             int c = read_edit(s->text_pos++);

+            tprint("c: %d\n", c);
             if (c == EOF || !match_chr(c, s))
             {
                 return false;

Then I run this test file both on Linux and OS X

@I/aᴇf
/
0J
@^A/before: / HT
@FS/ᴇ/E/
@^A/after:  / HT

on Linux it is:

*eitest``
before: aᴇf
c: 97
match: -31
c: 225
match: -31
c: 180
match: -76
c: 135
match: -121
after:  aEf

On OSX:

eitest``
before: aᴇf
c: 97
match: -31
c: 225
match: -31
c: 180
match: -31
c: 135
match: -31
c: 102
match: -31
c: 10
match: -31
?SRH   Search failure: '<^!><^?><^?>'

Now, I don't know how to interpret the negative integer in match. I'm only aware that c is the buffer content. I hope this can help trace down the cause of the difference.

fpjohnston commented 11 months ago

I don't have a complete answer for you, but what I can say is that the -31, -76, and -121 are the result of sign-extending an 8-bit value representing the Unicode characters. In unsigned decimal, they would be 225, 180, and 135, respectively.

What I'm guessing is that you've tripped across a difference in either processor architecture or compiler options between your Linux and Mac systems, such that a plain char isn't treated identically in both environments when it is negative.

I thought that I had specified that char was to be unsigned by default, but I'm obviously misremembering that, or perhaps there was a good reason for it being signed by default that I've forgotten. I think anything in the edit buffer should certainly be positive, as it otherwise creates confusion when trying to debug, as we have both discovered.

In any case, I will continue to investigate.

fpjohnston commented 11 months ago

Okay, I have a test I'd like you to run. Please change line 46 of Makefile to read as follows:

CFLAGS = -c -std=gnu11 -Wall -Wextra -Wno-unused-parameter -fshort-enums -funsigned-char -MMD

This will change a plain char to be unsigned.

Then rebuild on OS X, and let me know if it makes any difference to the result.

You may retain thetprint() statements for the time being.

Thanks.

fpjohnston commented 11 months ago

This shouldn't break any existing commands, so I will probably leave it in regardless. I re-ran my entire test suite, and nothing failed.

LdBeth commented 11 months ago

Yes it works now! I think the issue may be closed now. Thank you very much for the help.

fpjohnston commented 11 months ago

You're welcome. For what it's worth, I noted that although the edit buffer has type uchar, the match buffer is just char. That may be relevant for your problem. In any case, I will change the match buffer in addition to adding the -funsigned-char flag.

fpjohnston commented 11 months ago

I have not posted a new release yet because I had some progress in getting Unicode characters to display correctly in display mode, and I thought I'd see how far I could go with that. But since you had a workaround, I didn't think you needed anything else just yet. Feel free to let me know if that's not the case.

For what it's worth, I had originally wanted to use unsigned char (or its equivalent) throughout my code, but I ran into issues early on with lint and compiler warnings when I would use standard library functions that used char. I was reminded of this recently when I tried to use unsigned char throughout my code.

Another reason I'm holding off on a new version, though, is to make sure there isn't any issue with the use of the -funsigned-char compiler option. I only say that because I've known of that option for years, so I'm wondering whether I had a good reason for avoiding it. So far none all of the various builds seem to work okay, and all of my smoke tests are passing, so I'm willing to accept that I probably just glossed over it.

In any event, I expect that I'll upload whatever I have by this weekend.

fpjohnston commented 11 months ago

Version 200.36.2 has been posted. I have decided not to try to fix the display of UTF-8 character sequences, as it would involve major changes to TECO, which historically always treated bytes and characters as synonymous. I'm sure it could be done, but I just don't see any reason to embark on such as huge effort right now.

Thanks again for your assistance with this.

fpjohnston commented 11 months ago

This issue is now closed.