andyarvanitis / purescript-native-cpp-ffi

C++ foreign export implementations for the standard library
MIT License
21 stars 8 forks source link

an ffi question! #3

Closed iomeone closed 4 years ago

iomeone commented 5 years ago

Hi , sorry borther you again! I want to use purescript-unicode package and fall into a bug , and tried lots of time , but can not solve it.

my code is :

module Main where

import Prelude hiding (between,when)
import Data.Char.Unicode (digitToInt, isDigit)
import Effect (Effect)
import Effect.Console (logShow)

main :: Effect Unit
main = do
  logShow (isDigit '4')

it will call an ffi named toCharCode

so I wrote the ffi

FOREIGN_BEGIN( Data_Enum )

// foreign import toCharCode :: Char -> Int
exports["toCharCode"] = [](const boxed& c_) -> boxed {
  int  c = unbox <char>(c_);
  return c;
};

FOREIGN_END

but it keeps crash again and again, I also tried

const string  s = unbox<string>(c_);
const int  i = unbox<int>(c_);

but got no lucky! crash as usual!

would you please point out what's wong with the code? thank you very much! #

iomeone commented 5 years ago

I can narrow down the problem ! because in Data_Bounded ffi .

exports["topChar"] = 0x10FFFF; // unicode limit
exports["bottomChar"] = 0;

so, in this case , we should use const int i = unbox< int >(c_);

in other case , we should use const string s = unbox< string >(c_);

How can I represent 0 or three bytes as an unicode string ? (bottomChar and topChar case) Seems another utf8 issue...

andyarvanitis commented 5 years ago

Right, it is utf8 related. However, that in particular is a bug in my ffi, because I incorrectly carried over those values from the old implementation. It should be something like:

exports["topChar"] = u8"\U0010FFFF"; // unicode limit
exports["bottomChar"] = u8"\0";

You can try this, but I haven't tested this yet since I can't right now.

In general, the purescript Char type maps to a C++ string, with just one unicode character (which can be multibyte, per the utf8 standard).

iomeone commented 5 years ago

The hard part of support unicode is you can get an complete item from a string. For example:

       #pragma execution_character_set("UTF-8")
    juce::String s = CharPointer_UTF8("中文test");
    wcout.imbue(locale("", LC_CTYPE));

    for(int i = 0 ; i < s.length(); i++)
    {
        wcout << wchar_t(s[i]);
        cout << " int value is: " << hex << s[i] << endl;
    }

it will output:

中 int value is: 4e2d
文 int value is: 6587
t int value is: 74
e int value is: 65
s int value is: 73
t int value is: 74

The point of the test is that, it can magically dectect one , two or more bytes each char ocuupy!

I use juce library, and relapce all std::string with juce::String and get the right behaviour! for example, exports["topChar"] can write as

exports["topChar"] = juce::String(CharPointer_UTF8("\xF4\x8F\xBF\xBF")); 
// unicode limit utf8 \xF4\x8F\xBF\xBF is the same as 0x10FFFF with unicode encoding. (Unicode use 4 bytes to encoding).

toCharCode code can just return str[0] ! which is convenient!

// foreign import toCharCode :: Char -> Int
exports["toCharCode"] = [](const boxed& c_) -> boxed {

    const juce::String& s = unbox<juce::String> (c_);
    int charcode =s[0];
    assert(s.length() == 1);
    return charcode;
};

I still looking forward offical(your) implementing of unicode support!

andyarvanitis commented 4 years ago

the changes for

exports["topChar"] = u8"\U0010FFFF"; // unicode limit
exports["bottomChar"] = u8"\0";

went in a while back, so not sure why I didn't close this then.