Adjust corpus call sequence encoding of `string` types so that they are utf-8 compatible

tuturu-tech commented 6 months ago

Currently string types are just byte arrays which don't seem to adhere to any specific encoding, this makes it difficult to correctly decode strings from corpus call sequences in tools such as fuzz-utils. Using a standard encoding such as utf-8 would ensure other tools can correctly process the corpus.

tuturu-tech commented 3 months ago

Example call sequence and contract:

 {
  "call": {
   "from": "0x0000000000000000000000000000000000010000",
   "to": "0xa647ff3c36cfab592509e13860ab8c4f28781a66",
   "nonce": 1,
   "value": "0x0",
   "gasLimit": 12500000,
   "gasPrice": "0x1",
   "gasFeeCap": "0x0",
   "gasTipCap": "0x0",
   "data": "0xe8f0d2db0000000000000000000000000000000000000000000000000000000000000020000000000000000000000000000000000000000000000000000000000000005f65fc28be74620a5d4449a5b21393f93ac0e9089234a90b704852792227fa270ba3929fe4153840f9063d73e2c0518fd91b0dacd93edd00a47633da9ec576747da9b64ebafae9cbf6a2af2669c11669989c434713f73272573b326f98bb0ceb00",
   "dataAbiValues": {
    "methodSignature": "check_specific_string(string)",
    "inputValues": [
     "e\ufffd(\ufffdtb\n]DI\ufffd\ufffd\u0013\ufffd\ufffd:\ufffd\ufffd\b\ufffd4\ufffd\u000bpHRy\"'\ufffd'\u000b\ufffd\ufffd\ufffd\ufffd\u00158@\ufffd\u0006=s\ufffd\ufffdQ\ufffd\ufffd\u001b\r\ufffd\ufffd\u003e\ufffd\u0000\ufffdv3ڞ\ufffdvt}\ufffd\ufffdN\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\u0026i\ufffd\u0016i\ufffd\ufffdCG\u0013\ufffd2rW;2o\ufffd\ufffd\f\ufffd"
    ]
   },
   "AccessList": null,
   "SkipAccountChecks": false
  },
  "blockNumberDelay": 37521,
  "blockTimestampDelay": 390361
 },

pragma solidity ^0.8.0;

// medusa fuzz --compilation-target . --target-contracts "Nonutf" --corpus-dir corpus-utf
contract Nonutf {
    function check_specific_string(string memory provided) public {
        require(bytes(provided).length > 0);
        if (keccak256(bytes(provided)) == keccak256(bytes(unicode"00FC"))) {
            assert(false);
        }
    }
}

If we take the encoded data (ignoring offset and length) and convert it to string via chisel we can see it contains non-utf8 characters:

➜ bytes memory test = hex"65fc28be74620a5d4449a5b21393f93ac0e9089234a90b704852792227fa270ba3929fe4153840f9063d73e2c0518fd91b0dacd93edd00a47633da9ec576747da9b64ebafae9cbf6a2af2669c11669989c434713f73272573b326f98bb0ceb00"
➜ string(test)
Type: string
├ UTF-8: e�(�tb
]DI����:��4�
            pHRy"'�'
�>��v3ڞ�vt}��N�������&i�i��CG�2rW;2o��
                                      �

Since Solidity isn't limited to utf8 I think it's fine to keep non-utf8 encoding, but tools might have a hard time parsing the decoded arguments the way they are currently. Using some encoding format that can be easily converted to unicode might be better.

0xalpharush commented 3 months ago

I think we can just remove the call to QuoteToASCII here.https://github.com/crytic/medusa/blob/6750032502ed64952435dc408be3d8a1a107eb5c/fuzzing/valuegeneration/abi_values.go#L444 It does not seem like the decoding unquotes it https://github.com/crytic/medusa/blob/6750032502ed64952435dc408be3d8a1a107eb5c/fuzzing/valuegeneration/abi_values.go#L792-L797

crytic / medusa

Adjust corpus call sequence encoding of `string` types so that they are utf-8 compatible #339