bflattened / bflat

C# as you know it but with Go-inspired tooling (small, selfcontained, and native executables)
GNU Affero General Public License v3.0
3.65k stars 107 forks source link

On Linux, command line arguments cannot contain multi-byte codepoints because of limitation on the String type #167

Open lucabol opened 9 months ago

lucabol commented 9 months ago

This is the offending method, copilot suggests simple utf8 -> utf16 code, so maybe worth adding to libzero, but perhaps copilot is oversimplifying:

        private static unsafe string Ctor(sbyte* ptr)
        {
            sbyte* cur = ptr;
            while (*cur++ != 0) ;

            string result = FastNewString((int)(cur - ptr - 1));
            for (int i = 0; i < cur - ptr - 1; i++)
            {
                if (ptr[i] > 0x7F)
                    Environment.FailFast(null);
                Unsafe.Add(ref result._firstChar, i) = (char)ptr[i];
            }
            return result;
        }
MichalStrehovsky commented 9 months ago

You're running into the FailFast, right? Yeah, zerolib does cut corners like that. This constructor has different behaviors depending on whether this is Windows or Linux (it's "current codepage to UTF-16" on Windows, and "UTF-8 to UTF-16" on Linux).

lucabol commented 9 months ago

Well, it's not 'UTF-8 to UTF-16' on Linux, it is 'ASCII to UTF-16' (by design). It seems too limiting, even for zerolib. But perhaps you feel differently?

MichalStrehovsky commented 9 months ago

I'm not opposed to adding a helper for this. It might not be the only place where utf-8 to 16 would be useful.

lucabol commented 9 months ago

I am keeping track of everything that 'feels like' zerolib in a single file. Perhaps worth doing a simple PR, or discussion, when I am finished.

In case you wonder ... I am playing around with exposing utf8 cmd line args in the spirit of utf8everywhere, moving on experimenting with a no-allocation programming model (as MISRA C), to finish with a linear allocator. At least, this is the rough idea, we'll see.

ghost commented 9 months ago

On Windows, I think using the Windows API. On Linux, I think using libiconv. Don't write your own encoding converter.