colobot / colobot

Source code of open-source Colobot: Gold Edition project developed by Epsitec and TerranovaTeam
http://colobot.info/
GNU General Public License v3.0
1.13k stars 192 forks source link

message doesn't display the numeric value of type byte, but the character representation (game crashes sometimes) #1446

Closed a-v-o-dev closed 3 months ago

a-v-o-dev commented 3 years ago

Test case:

byte a = -1;
message (a);
byte b = 256;
message (b);
byte c = a - 1;
message (c);
byte d = b + 1;
message (d);

There is no compiler or runtime error. message just shows strange characters.

This is probably also true for the other integer types, but it's not so obvious because numbers outside the range don't occur so often. Assigning a number outside the valid range should be detected already at compile time. Leaving the range in a calculation should produce a runtime error.

tomaszkax86 commented 3 years ago

Assigning big values to small variables has a well-defined behavior and is not considered an error. The values are simply truncated to fit in the variable. It is programmer's responsibility to make sure numeric precision is not lost due to conversion.

a-v-o-dev commented 3 years ago

a) This game is also thought for teaching programming ("It has been adapted for the specific purposes of COLOBOT and for an efficient pedagogical approach.", SatCom: Programming - the CBOT language), isn't it? Is "It is programmer's responsibility" an efficient pedagogical approach? One reason that (some) programming languages (e.g. free pascal) have introduced types is to give the compiler or the runtime engine the chance to do range checks and show approbiate error messages.

for (byte i = 0; i < 300; i++)
{
}

Do you see the error? A student who doesn't see the error starts the program and waits and waits ... and wonders why the program doesn't end. How can this student find out what's going wrong with his program? With the implemented debugging features he has to click over 1500 times (min. 6 times per loop) onto "1" ("One step") until he might realize that i changes from 255 to 0. And if he doesn't realize it on the first run, he has to do again over 1500 clicks. And this is only byte; how many clicks would be needed for short?

It would need a debugging feature like a conditional breakpoint to make it the "programmer's responsibility".

b) Some more tests with the other integer types showed, that the problem seems to be how the message instruction handles byte:

long a = 64;
message (a);  // displays 64
int b = 64;
message (b);  // displays 64
short c = 64;
message (c);  // displays 64
byte d = 64;
message (d);  // displays @
d = 65;
message (d);  // displays A

byte seems to be treated as C++ char.

I tried the example loop from a) with a message instruction:

for (byte i = 0; i < 300; i++)
{
    message (i);
}

The game crashed with this error message:

Unhandled exception occurred!
==============================
Type: std::invalid_argument
Message: Unexpected UTF-8 continuation byte
==============================

BTW: Are all integer types in CBOT meant to be signed integers?

melex750 commented 3 years ago

The scenario you describe in "a)" could be mitigated with in-game documentation for byte type.

byte seems to be treated as C++ char.

Yes, byte is currently using signed char as its underlying type.

BTW: Are all integer types in CBOT meant to be signed integers?

char type in CBOT is using 32-bit unsigned integer.

Handling for that exception with try...catch should probably be implemented in Ui::CDisplayText or something.

tomaszkax86 commented 3 years ago

When you're using byte you're making a decision to use a type that has only 8 bits of storage and is, therefore, limited to range of numbers between 0 and 255. If you don't know what byte is or what valid usage for byte is, you shouldn't use it. The game always had much safer int and float and that's what new CBOT programmers are introduced to. Other types were added quite recently and only for advanced programmers who know what they're doing. If you see byte and use it without learning anything about it, it's your fault when your program fails.

Perhaps we can introduce compile-time warnings about narrowing conversions but we can't treat them as errors. And we can't catch integer overflow at runtime, it would be not only bad for performance but also bad for programmers who know what they're doing and want to use overflow as they'd normally do in other languages.

As for b), it's clearly a bug in message() and it needs to be fixed.

a-v-o-dev commented 3 years ago

Is the range of byte in CBOT really meant to be 0 to 255 (I assumed that at first, too), or is it signed like the other integers (short, int and long), so -128 to 127? It seems to be implemented as signed char. I can't really check, because the game crashes even without the message instruction when 128 is assigned to a byte variable and I try to run it step by step. Probably the same method is used for message and to display the contents of the variables while debugging ("One step"). Edit: Well, now I've tested an empty loop:

for (byte i = 0; i < 150; i++) {}

Who would expect this to be an endless loop? It is.

So byte is in CBOT implemented as a signed integer with range -128 to 127. So the two of us, you and I, didn't know what byte is, at least what it is in CBOT.

Using overflow with unsigned integers might often be used, but an internet search with "signed integer overflow" shows pages that say: Signed integer overflow is undefined behavior in C language. As it's undefined behavior, no programmer can know what he's doing, if he uses it.

melex750 commented 3 years ago

You can use all 8 bits any way you like, even to store more than one value at the same time. I should mention the undocumented >>> unsigned right shift operator was recently fixed. if (your bool be[8];) { you byte; }

melex750 commented 3 years ago

On the subject of undefined behavior, If byte were unsigned the for loops above would be considered as using undefined behavior with signed/unsigned comparison, but this isn't relevant for CBOT or how a byte is normally used.

a-v-o-dev commented 3 years ago

I have no problem with an integer type with the range -128 to 127. This can be documented, so that everyone knows. Same with that there is no range check.

My proposal at this point of understanding: Better not name this type byte. byte is commonly known and in some other languages implemented as unsigned integer 0-255. byte is not yet official and therefore a change can easily be done at this point of time. With a different name, CBOT is open to implement unsigned integers at some other point of time. This would be hard to do (many CBOT programs would have to be rewritten) after byte is introduced as name for this signed integer type. A student who just learned the difference between bit, byte, word etc. wouldn't be confused about the different range. Even advanced users (who know what a byte is, so that they think there's no need for them to read the documentation) wouldn't be surprised.

In SQL there is the integer type tinyint for signed integers with the range -128 to 127. Like short as a short form of ShortInt, this type could be named tiny. Someone who knows SQL would guess right, others who don't know SQL can't assume anything, but would be invited to read the documentation about this type.

Maybe there is a better name than tiny. C doesn't offer a name other than char, which is already used in CBOT. In Freepascal it is named Int8 or ShortInt, but short is already used in CBOT. Java doesn't know unsigned integers, so there byte is defined like in CBOT now. In Visual Basic it is named SByte.

Introducing byte as name for this type might be the decision that there might never be unsigned integers in this language. The argument would always be "backward compatibility".

melex750 commented 3 years ago

A programmer is provided with a group operators and containers for working with bytes and bits, if they're working at level of bytes and bits they don't have to care if it is signed or unsigned, the meaning each byte or bit is assigned by them. If they want to interpret a byte as signed or unsigned they can.

a-v-o-dev commented 3 years ago

Experts can do with bits and bytes and unsigned and signed whatever they want, that's true, but that is not what I'm talking about. At first I wanted to understand if it was really meant to implement a signed 8-bit type and as far as I realized all CBOT integers are meant to be signed, so this implementation is consequent. So, for me this question is answered. At second I suggested to reduce confusion. Reminder: "and for an efficient pedagogical approach." (from my second post, respective from SatCom) Therefore my view is not what a programmer can do or not, but what introduces or minimizes confusion. At third I was thinking at the further development of the language and wanted to show that a future opportunity might get lost and that this could be avoided at this point of time.

melex750 commented 3 years ago

I don't think anyone knows if byte was meant to be signed or unsigned, and there can't be much confusion after discovering that signed and unsigned bytes are exactly the same thing, and that most single bytes of data don't represent any specific number or symbol. And it really doesn't change how a byte is used, whether it's called signed or unsigned, or how one byte can represent the values of many more bytes. If the data fits in a byte we can use any byte to store it.

A byte variable is almost always used as a container of bits, often just to store or move partial, arbitrary information as part of a group of bytes where one byte has no meaning by itself.

Any byte type can act as a little integer for doing math or iterating a range of values, but it usually doesn't matter what values or symbols a single byte represents because the value, symbol, or group of booleans can be chosen arbitrarily by the person using that single byte. A single byte or even a few bits can be used to help compress many more bytes of data, as done with LZSS and LEB128.

melex750 commented 3 years ago

Maybe message() and variable list should show the binary representation, like 0b01001011, if the argument given is byte.

a-v-o-dev commented 3 years ago

Just to understand:

int a = 65535; // bitmask: 0000 0000 0000 0000 1111 1111 1111 1111
message (a);   // display: 65535
short b = a;   // bitmask:                     1111 1111 1111 1111  (bitmask preserved)
message (b);   // display: -1                                       (sign and value changed)
int c = b;     // bitmask: 1111 1111 1111 1111 1111 1111 1111 1111  (bitmask changed)
message (c);   // display: -1                                       (sign and value preserved)

In one direction (large to small) the bitmask is preserved (only high bytes are omitted), sign and value might change. In the opposite direction (small to large) the bitmask might change (expanding the sign bit), sign and value are preserved. Is this the intended behavior? (I used short in this example, because of the problem with message and byte.)

If I understand right, then you see a byte as a bit container, not as a number like the other integer types. The other integer types can be used as bit containers, too, and a byte can also represent a number, so where's the difference?

I wouldn't handle byte different from the other integers, so I'd show the numerical value (-128 .. 127). On the other hand a function to convert any of the integer types into a string that shows the bitmask could be useful, not only for byte.

melex750 commented 3 years ago

It should be pointed out that there's currently a lack of unit tests confirming correctness of various conversions and operations.

It looks like it's working correctly and using Two's Complement.

CBOT code for converting integers to string:

extern void Test()
{
    byte i = 0b100100; //= 36 = 0x24 = 0b00100100 = '$' = '\u0024';
    message(toBits(i));
    message(toHex(i));
}

string toBits(byte i)
{   // change 'byte' to any integer type
    short nbits = sizeof(byte) * 8;
    string str = "";
    for (short s = 0; s < nbits; s++)
    {
        if (0 == s % 4 && s > 3) str = '_' + str;
        str = (i & 0x01) + str; // get the low bit
        i >>>= 1; // move all bits right by 1
    }
    return "b" + str;
}

string toHex(byte i)
{   // change 'byte' to any integer type
    short bits = sizeof(byte) * 8;
    string str = "";
    for (short s = 0; s < bits; s += 4)
    {
        if (0 == s % 16 && s > 15) str = "_" + str;
        char val = (i & 0x0F); // get the low 4 bits
        val += (val < '\u000A') ? '0': 'A' - '\u000A';
        str = val + str;
        i >>>= 4; // move all bits right by 4
    }
    return "x" + str;
}
Anton-V-K commented 2 years ago

Displaying bytes (implicitly converted to chars?) still crashes colobot on dev branch when loop reaches -128, which is treated as character (Euro sign). The test code:

for (byte i = 0; i < 300; i++)
{
    message (i);
}

The call stack:

colobot.exe!_CxxThrowException(void * pExceptionObject, const _s__ThrowInfo * pThrowInfo) Line 75   C++
colobot.exe!StrUtils::Utf8CharSizeAt(const std::string & str, unsigned int pos) Line 195    C++
colobot.exe!Gfx::CText::StringToUTFCharList(const std::string & text, std::vector<Gfx::UTF8Char,std::allocator<Gfx::UTF8Char>> & chars) Line 1063   C++
colobot.exe!Gfx::CText::DrawString(const std::string & text, Gfx::FontType font, float size, Math::IntPoint pos, int width, int eol, Gfx::Color color) Line 1134    C++
colobot.exe!Gfx::CText::DrawText(const std::string & text, Gfx::FontType font, float size, Math::Point pos, float width, Gfx::TextAlign align, int eol, Gfx::Color color) Line 517  C++
colobot.exe!Ui::CLabel::Draw() Line 81  C++
colobot.exe!Ui::CWindow::Draw() Line 887    C++
...

The exception is thrown from StrUtils::Utf8CharSizeAt:

...
    // Invalid char - unexpected continuation byte
    if (isUtf8ContinuationByte(c))
        throw std::invalid_argument("Unexpected UTF-8 continuation byte");
...
hexagonrecursion commented 3 months ago

A programmer is provided with a group operators and containers for working with bytes and bits, if they're working at level of bytes and bits they don't have to care if it is signed or unsigned, the meaning each byte or bit is assigned by them. If they want to interpret a byte as signed or unsigned they can.

Incorrect. CBot type byte behaves as a signed integer (essentially just like int but with a smaller range) when used in integer mathematical operations:

extern void New()
{
    byte a = 255; // becomes -1
    byte b = a / 2;
    int i = b;
    message(i); // prints 0
}
hexagonrecursion commented 3 months ago

The crash was fixed by https://github.com/colobot/colobot/commit/9c53d4dc0a8f6287763976d7b93e179a5fd41909 (only available on the dev branch so far)

melex750 commented 3 months ago

...when used in integer mathematical operations...

It wasn't a comment about integer arithmetic using byte. It was a comment about the fact that you can interpret a byte as unsigned:

    byte b = 0xFF; // -1

    int i = b & 0b11111111;

    message(""+i); // 255