ekmett / trifecta

Parser combinators with highlighting, slicing, layout, literate comments, Clang-style diagnostics and the kitchen sink
http://ekmett.github.com/trifecta/
Other
296 stars 49 forks source link

Whither Text? #51

Open bitemyapp opened 8 years ago

bitemyapp commented 8 years ago

Thanks to another IRC user, I was able to get Text parsing with Trifecta via this code:

-- Text Rope and parsing
instance Reducer Text Rope where
  unit = unit . strand . encodeUtf8
  cons = cons . strand . encodeUtf8
  snoc r = snoc r . strand . encodeUtf8

parseText :: Parser a -> Delta -> Text -> Result a
parseText p d inp =
  starve $ feed inp $ stepParser (release d *> p)
  mempty mempty

But the copying makes me unhappy. I asked in IRC but no-one really knew, why does Trifecta only support UTF-8 ByteStrings as a first-class input stream?

phadej commented 8 years ago

supporting Text "natively" would mean supporting also UTF16 bytestrings. I see it may be possible, but I'm not sure how.

bitemyapp commented 8 years ago

@phadej is there something intrinsic to how UTF16 bytestrings are laid out that would mean this requires a large-scale revision of the library or is it schlep? Or something in-between?

phadej commented 8 years ago

@bitemyapp trifecta works on bytestring, I don't see a problem to make newtype Parser16 which would have different CharParsing and DeltaParsing instances, but otherwise the machinery could stay the same; i.e. we need to tell how to get Char from ByteString which is encoded differently.

The problem is that ByteString is

data ByteString = PS {-# UNPACK #-} !(ForeignPtr Word8) -- payload
                     {-# UNPACK #-} !Int                -- offset
                     {-# UNPACK #-} !Int                -- length

but Text is

-- | A space efficient, packed, unboxed Unicode text type.
--
-- Internally, the 'Text' type is represented as an array of 'Word16' UTF-16 code units.
-- The offset and length fields in the constructor are in these units, not units of 'Char'.
data Text = Text
    {-# UNPACK #-} !Array          -- payload (Word16 elements)
    {-# UNPACK #-} !Int              -- offset (units of Word16, not Char)
    {-# UNPACK #-} !Int              -- length (units of Word16, not Char)
    deriving (Typeable)

-- | Immutable array type.
data Array = Array {
      aBA :: ByteArray#
    }

And I'm not sure if one can convert from ByteArray# to ForeignPtr Word8 without copy.

bitemyapp commented 8 years ago

And I'm not sure if one can convert from ByteArray# to ForeignPtr Word8 without copy.

This is the essence of my worry - that it would force a larger rewrite to play nice with ByteArray#,

ekmett commented 8 years ago

The real reason was massive amounts of code duplication would be required. I'm open to switching everything to Text from ByteString, but that turns out to be a bit heavy as well. At the time Text didn't support the codepoint-counted cut operations we needed to avoid massive asymptotic slowdowns. I managed to get Bryan to add them, but the few months between asking and receiving robbed the the rebuild of any steam.