Open bitemyapp opened 8 years ago
supporting Text
"natively" would mean supporting also UTF16
bytestrings. I see it may be possible, but I'm not sure how.
@phadej is there something intrinsic to how UTF16 bytestrings are laid out that would mean this requires a large-scale revision of the library or is it schlep? Or something in-between?
@bitemyapp trifecta works on bytestring
, I don't see a problem to make newtype Parser16
which would have different CharParsing
and DeltaParsing
instances, but otherwise the machinery could stay the same; i.e. we need to tell how to get Char
from ByteString
which is encoded differently.
The problem is that ByteString
is
data ByteString = PS {-# UNPACK #-} !(ForeignPtr Word8) -- payload
{-# UNPACK #-} !Int -- offset
{-# UNPACK #-} !Int -- length
but Text
is
-- | A space efficient, packed, unboxed Unicode text type.
--
-- Internally, the 'Text' type is represented as an array of 'Word16' UTF-16 code units.
-- The offset and length fields in the constructor are in these units, not units of 'Char'.
data Text = Text
{-# UNPACK #-} !Array -- payload (Word16 elements)
{-# UNPACK #-} !Int -- offset (units of Word16, not Char)
{-# UNPACK #-} !Int -- length (units of Word16, not Char)
deriving (Typeable)
-- | Immutable array type.
data Array = Array {
aBA :: ByteArray#
}
And I'm not sure if one can convert from ByteArray#
to ForeignPtr Word8
without copy.
And I'm not sure if one can convert from
ByteArray#
toForeignPtr Word8
without copy.
This is the essence of my worry - that it would force a larger rewrite to play nice with ByteArray#
,
The real reason was massive amounts of code duplication would be required. I'm open to switching everything to Text
from ByteString
, but that turns out to be a bit heavy as well. At the time Text
didn't support the codepoint-counted cut operations we needed to avoid massive asymptotic slowdowns. I managed to get Bryan to add them, but the few months between asking and receiving robbed the the rebuild of any steam.
Thanks to another IRC user, I was able to get Text parsing with Trifecta via this code:
But the copying makes me unhappy. I asked in IRC but no-one really knew, why does Trifecta only support UTF-8 ByteStrings as a first-class input stream?