bugs: unicode.sub modifying strings -> buffer:read having trouble with non ascii characters

mpmxyz commented 9 years ago

I got a bug report for my own program: http://oc.cil.li/index.php?/topic/511-crunch-break-the-4k-limit/#entry2317 I found out that the bug is not within my program: When it loads this file: http://pastebin.com/z9w0Bwij It crashes because the last character of the file is missing. (a "]" closing a long string) During debugging I found out, that one character is split into two erroneous ones instead: http://i.imgur.com/LtpKkHM.png

I further boiled it down to an error within the buffer:read / readBytesOrChars function:

   1  local left = n - len(buffer)
   2  buffer = buffer .. sub(self.bufferRead, 1, left)
   3  self.bufferRead = sub(self.bufferRead, left + 1)

Here is an step by step example to show what's wrong:

n=3
buffer=""
self.bufferRead="\226\148\128\226\148"
->
left = 3 - 0 = 3
buffer = "" .. unicode.sub("\226\148\128\226\148", 1, 3) = "\226\148\128\195\162" --Where did "\195\162" come from?
self.bufferRead = "" --> triggers a new reading operation
->
self.bufferRead="\128\226\148\128"
left = 3 - 2 = 1 --only one character missing...
buffer = buffer .. unicode.sub("\128\226\148\128", 1, 1) = "\226\148\128\195\162" .. "\194\128" -- Where did "\194" come from?

So the current bug is caused by unicode.sub adding some garbage if there are incomplete characters. When doing that I'd also recommend checking if the non binary version of buffer:read is working correctly when it receives only the first part of the last character of a chunk.

fnuecke commented 9 years ago

Hrm, it's kind of a matter of definition as to what the actual bug is. I'd actually say the issue is with the buffer's current implementation, as it kind of should check if the current byte buffer ends in a valid unicode char. I think it's OK to have unicode.sub and family expect a valid unicode string as input. So I suppose what's needed is some way to allow the buffer to detect up to where the buffer is valid... I have some ideas, but none of them are nice. Any suggestions on how to best tackle this welcome.

SolraBizna commented 9 years ago

My implementation of the unicode library handles non-BMP code points and erroneous sequences well, but it's in C. The Lua architecture one converts from UTF-8 to UTF-16 and counts UTF-16 code units, which is the source of the questionable Astral Plane support (Java's fault) and the extra garbage from unicode.sub (which I'll wager is U+FFFD). I can't think of a reason my implementation couldn't be rewritten in Java/Scala and applied to byte arrays. That would avoid at least this particular problem.

As for determining whether a buffer ends in a valid sequence... UTF-8 is specifically designed to make this kind of thing easy. A one-byte sequence is always 0x00--0x7F; 0xC0--0xDF always introduces a two-byte sequence; 0xE0-0xEF always a three-byte sequence; 0xF0--0xF7 a four-byte sequence. 0x80--0xBF only occurs as the second, third, or fourth byte in a sequence and no other byte does---if you get one you have to look at most three bytes back to find a start-of-sequence. Five-and-six byte sequences were defined originally, but are not allowed anymore, therefore 0xF8--0xFF never occur at all. A Lua regular expression could almost handle this. Only up to four final bytes need to be checked to determine if the last character is valid/complete.

payonel commented 7 years ago

I have decided that the buffer library can assume the underlying stream returns valid character sets When you use io.open to open a file in text mode (not binary), the underlying host filesystem returns valid utf8 sets, and does not slice them. The buffer library expects sequences to be complete and I'm okay with this expectation.

If user code is reading from a custom stream that has utf8 sequences, and assumes to read it in text mode, then that stream must respect the sequence breaks appropriately.

If you cannot make your stream obey, then you need to read the stream in binary mode.

SolraBizna commented 7 years ago

When you use io.open to open a file in text mode (not binary), the underlying host filesystem returns valid utf8 sets, and does not slice them.

Maybe that works on Windows, but unless Java is adding one of its own, POSIX systems do not have a separate "text mode".

payonel commented 7 years ago

@SolraBizna you make a good point. My reasoning is based on the filesystem component method open allowing a binary and text mode. Thus, to the perspective of OpenOS (or any other runtime and architecture), the stream handle was opened in a specified mode, and thus should be able to safely assume the stream is in text or binary mode. And thus, it is my opinion that OpenOS is not responsible to verify the stream behaves in a textual or binary manner.

IF we are not respecting mode in the java layer (and we might not be) for text mode, then we'll need to use a InputStreamReader(new FileInputStream(file), StandardCharsets.UTF_8) where we can. I'll repoen this ticket until I have verified what we're currently doing.

But, in the end, I'm not handling this in the OpenOS layer.

payonel commented 4 years ago

it only took 5 years :) but yes, i decided to fix this i found a fast way to check the buffer io result for a valid/expected result size. the buffer will return more chars in case a bad sequence is detected, and if the end of the stream is read, a chunk read will occur -- this resolve this use case and showed to preserve data in testing

MightyPirates / OpenComputers

bugs: unicode.sub modifying strings -> buffer:read having trouble with non ascii characters #1207