ksh93: random behaviour of `read -n <nchar>` for multi-byte characters.

att / ast

AST - AT&T Software Technology

Eclipse Public License 1.0

560 stars 152 forks source link

ksh93: random behaviour of `read -n <nchar>` for multi-byte characters. #22

Open stephane-chazelas opened 8 years ago

stephane-chazelas commented 8 years ago

Reproduced with version sh (AT&T Research) 93u+ 2012-08-01 and version sh (AT&T Research) 93v- 2014-12-24 on Debian GNU/Linux amd64:

According to the man page read -n reads a number of bytes, while read --help says characters.

Tests are inconsistent: here testing in a UTF-8 locale with the 3-byte € (EURO U+20AC) character:

$ ksh -c 'for ((i=1;i<=6;i++)); do echo €€€€€€€€ | IFS= read -rn "$i" a; printf "$i %q\n" "$a"; done'
1 $'\u[20ac]'
2 $'\u[20ac]\u[20ac]'
3 $'\u[20ac]'
4 $'\u[20ac]\u[20ac]\u[20ac]'
5 $'\u[20ac]\u[20ac]\u[20ac]'
6 $'\u[20ac]\u[20ac]'

The 1 case suggests a number of characters, the 3 case a number of bytes, the rest doesn't seem to make any sense.

read -N doesn't have the issue (and seems to take a number of characters):

$ ksh -c 'for ((i=1;i<=6;i++)); do echo €€€€€€€€ | IFS= read -rN "$i" a; printf "$i %q\n" "$a"; done'
1 $'\u[20ac]'
2 $'\u[20ac]\u[20ac]'
3 $'\u[20ac]\u[20ac]\u[20ac]'
4 $'\u[20ac]\u[20ac]\u[20ac]\u[20ac]'
5 $'\u[20ac]\u[20ac]\u[20ac]\u[20ac]\u[20ac]'
6 $'\u[20ac]\u[20ac]\u[20ac]\u[20ac]\u[20ac]\u[20ac]'

zakukai commented 7 years ago

I seem to be getting some rather odd behavior here. I wanted to turn this example into a test I could use on different shells. Bash posed a problem because "shopt lastpipe" was not set. So I rearranged the test to take input from a "here string":

_file read-n_bytes_orcharacters.sh

for ((i=1;i<=6;i++)); do
    IFS= read -rn "$i" a <<<'€€€€€€€€'
    printf "$i %q\n" "$a"
done

In this case, ksh seems to get it right:

$ ksh read-n_bytes_or_characters.sh
1 $'\u[20ac]'
2 $'\u[20ac]\u[20ac]'
3 $'\u[20ac]\u[20ac]\u[20ac]'
4 $'\u[20ac]\u[20ac]\u[20ac]\u[20ac]'
5 $'\u[20ac]\u[20ac]\u[20ac]\u[20ac]\u[20ac]'
6 $'\u[20ac]\u[20ac]\u[20ac]\u[20ac]\u[20ac]\u[20ac]'

But if I switch it back to "echo '€€€€€€€€' | read ..." then it fails again. The "here string" feature is implemented using an unlinked temporary file, while the pipeline is (of course) a socket pair in ksh. It seems to go wrong when the input is a socket, tty, or pipe, but it does OK when the input is a file.

stephane-chazelas commented 7 years ago

2017-05-22 09:46:36 -0700, zakukai: [...]

It seems to go wrong when the input is a socket, tty, or pipe, but it does OK when the input is a file. [...]

The fact that we get a different code path when the input goes from a regular file doesn't surprise me.

When getting input from a seekable file (like a regular file and some device files), it's easier. To read the requested amount of characters, ksh93 can read a large block, find out how many bytes those characters occupy and seek back after that.

While for a non-seekable file like a pipe/socket/tty device, it needs to read one byte at a time to make sure it doesn't read past the requested amount of characters.

I suppose that's when it gets confused.

-- Stephane

zakukai commented 7 years ago

I haven't quite nailed it down yet but it looks like it comes down to this line

ast/src/cmd/ksh93/bltins/read.c : 649
if((size -= x) > 0 && (up >= cur || z < 0) && ((flags & NN_FLAG) || z < 0 || m > c))
    continue;   // Otherwise, end the read loop

Basically, when using -n (that's N_FLAG, as opposed to -N which is NN_FLAG), the code loops: On each iteration there are (x) characters remaining, so the implementation reads (x) bytes (since each character must take at least one byte). Then the newly-completed multi-byte characters are counted (starting from the pointer (up) and extending to the pointer (cur)) to determine how many characters will be left on the next iteration. Usually there will be some number of extra bytes read in that are carried over to the next iteration. But when the end of the last read coincides with a character boundary, it hits an edge case: (up == cur): the end of the last complete character read coincides with the last byte read from the stream (z > 0): The call to mbsize(up) on line 644 returned nonzero, because *up = 0 (an end-of-string marker set on (cur) to prevent reading beyond the buffered data) - and 0 is a single-byte character in UTF-8. So the back half of the "if" condition ((flags & NN_FLAG) || z < 0 || m > c) fails on this edge case, because the NN flag isn't set and z>0 (I don't really understand (m > c) yet..)

I'll try to work out the parts I'm not quite understanding yet and put a patch together.

siteshwar commented 6 years ago

Ksh uses sockets instead of real pipes to implement pipes in shell. This has caused issues on multiple occasions. For e.g.

$ cat /etc/passwd | head -1 /dev/stdin
head: cannot open '/dev/stdin' for reading: No such device or address

I tried removing this code. It fixes above issue, but it caused some of the io tests to fail. A number of failing tests seems to be incorrect. For e.g. consider this test. It expects different outputs on different locales. This command:

$ (print -n foo; sleep 1; print -n bar) |read -n6 temp; echo $temp

should always output foobar. With current development version, it gives foo. And the output may change with locale. Switching to real pipes fixes this issue, however there are going to be other regressions if we switch to real pipes. We should do a deeper analysis of how this is going to affect scripts.

Regarding fix for this bug, you can see my experiments in this branch.

DavidMorano commented 6 years ago

Of course some of you remember why KSH had to use sockets on some platforms: KSH needs to occasionally read only up to a new-line character and so it needs to PEEK into the input byte stream. Although older SysV type UNIXi allow for PEEKs on pipes, many newer systems (Linux?) do not allow for that (PEEK's on pipes). So KSH has to resort to sockets to get the PEEK capability on platforms that do not support PEEK for pipes. What KSH does now-a-days, exactly, on each platform, I do not know. I would like to think that it only resorts to sockets for shell "pipes" when needed, but I do not know if this is the case any longer (it might be using sockets on all platforms now, for all I know).

krader1961 commented 5 years ago

See also issue #1186 which is almost certainly the same problem as this issue.