IntersectMBO / plutus

The Plutus language implementation and tools
Apache License 2.0
1.57k stars 480 forks source link

Fix BuiltinByteString literal construction #6505

Open zliu41 opened 1 week ago

zliu41 commented 1 week ago

The IsString instance of Haskell's ByteString converts a String to ByteString by:

ghci> "\127" :: BS.ByteString
"\DEL"
ghci> "\255" :: BS.ByteString
"\255"
ghci> "\256" :: BS.ByteString
"\NUL

On the other hand, the IsString instance of BuiltinByteString uses UTF-8 encoding, which means it is only consistent with ByteString for Chars <= 127:

ghci> "\127" :: BuiltinByteString
"\DEL"
ghci> "\255" :: BuiltinByteString
"\195\191"
ghci> "\256" :: BuiltinByteString
"\196\128"

This is also the case in Plinth, for instance $$(compile [|| "\255" ||]) :: CompiledCode BuiltinByteString results in #c3bf. This is because GHC converts a String into a Literal containing a UTF-8 encoded ByteString. In this particular case, this is what the plugin sees: unpackCStringUtf8# "\195\191".

This is a problem, not only because of the inconsistency between ByteString and BuiltinByteString, but more importantly because it means there is no easy way to construct a BuiltinByteString literal in Plinth, where some bytes are between 128 and 255. Thus, it is better to adopt the ByteString's behavior.

To do so, we'll need to update both the IsString instance of BuiltinByteString, as well as updating the plugin.

Updating the instance is easy, because the plugin does not compile the unfolding of the instance, but instead has special logic to handle it directly. So we can write any Haskell code in the instance, which means we can simply delegate to ByteString's instance.

Regarding the plugin, we'll need to update the special logic that handles BuiltinByteString's IsString instance. There are two places, here and here. The first deals with stringToBuiltinByteString, and the second deals with fromString @BuiltinByteString.

In both places, we currently use stringExprContent to extract the literal ByteString from the CoreExpr. We need to update this extraction logic to detect unpackCStringUtf8# bs. Instead of returning bs, it should return fromString @ByteString . Text.unpack . Text.decodeUtf8' $ bs. Note that the extraction logic for BuiltinString should stay the same.

Property based tests should be added verifying that fromString for ByteString and BuiltinByteString behave consistently, as well as String and BuiltinString.

michaelpj commented 1 week ago
zliu41 commented 5 days ago

Have we considered asking people to create bytestring literals from lists of integers? e.g. [0x01, 0x02]. We could potentially even make this nice by using OverloadedLists.

That's doable in Haskell, but I don't think it's possible in Plinth

colll78 commented 10 hours ago

To be in-line with what all the other smart contract languages expose, Plinth should allow the creation of byte string literals in three ways:

  1. ByteArray: as an array of values ranging from 0 to 255. In Aiken:
    // both are valid ByteArray definitions:
    #[102, 111, 111]
    #[0x66, 0x6f, 0x6f]

    In Plinth we can expose something like this:

    
    import PlutusTx.Prelude qualified as P
    import qualified Data.ByteString as BS
    import Data.Word (Word8)

builtinByteArray :: [Word8] -> BuiltinByteString builtinByteArray = P.toBuiltin . BS.pack

ourByteArray :: BuiltinByteString ourByteArray = builtinByteArray [ 0x01, 0x23, 0x45, 0x67, 0x89, 0xab, 0xcd, 0xef, 0xfe, 0xdc, 0xba, 0x98, 0x76, 0x54, 0x32, 0x10, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88, 0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff, 0x00 ]

2. ByteString: as a UTF-8 encoded byte string. This is the default representation for double-quoted constants in the other sc languages targeting Plutus Core. 
In Aiken:
```rust
"foo"

In Plinth and Plutarch this is the default IsString instance.

ourByteString :: BuiltinByteString 
ourByteString = "foo" 
  1. HexByteString: as a hex-encoded byte string. It is common to manipulate base-16 encoded byte strings in blockchain environments (transaction ids, script hashes, public key hashes, etc), so it makes sense to provide a short-hand for defining BuiltinByteString literals from hexadecimal strings.

In Aiken, this is a double-quoted byte string prefixed with a # like so:

#"666f6f" 

In Plutarch this is:

someHexLiteral :: ClosedTerm PByteString 
someHexLiteral = phexByteStr "666f6f"

In Plinth we can expose something like this:

import Data.Word (Word8)
import Data.Char (toLower)
import PlutusLedgerApi.V2 (BuiltinByteString)
import PlutusTx.Prelude qualified as P
import qualified Data.ByteString as BS

toBuiltinHexString :: String -> BuiltinByteString 
toBuiltinHexString = P.toBuiltin . toHexString

toHexString :: String -> BS.ByteString 
toHexString = 
  BS.pack . f
  where
    f "" = []
    f [_] = error "UnevenLength"
    f (x : y : rest) = (hexDigitToWord8 x * 16 + hexDigitToWord8 y) : f rest

hexDigitToWord8 :: HasCallStack => Char -> Word8
hexDigitToWord8 = f . toLower
  where
    f :: Char -> Word8
    f '0' = 0
    f '1' = 1
    f '2' = 2
    f '3' = 3
    f '4' = 4
    f '5' = 5
    f '6' = 6
    f '7' = 7
    f '8' = 8
    f '9' = 9
    f 'a' = 10
    f 'b' = 11
    f 'c' = 12
    f 'd' = 13
    f 'e' = 14
    f 'f' = 15
    f c = error ("InvalidHexDigit " <> [c])