algorand / go-algorand

Algorand's official implementation in Go.
https://developer.algorand.org/
Other
1.35k stars 473 forks source link

Msgpack bytestring/text encoding #1968

Open georgeee opened 3 years ago

georgeee commented 3 years ago

Subject of the issue

Hi! I attached msgpack encoding of block 9414973 from betanet (retrieved via url). It doesn't meet msgpack specification.

It encodes some address field of a transaction as string. However specification of messagepack clearly says that this encoding is for UTF-8 strings, not bytestring.

Attachments

file.zip

Your environment

N/A

Steps to reproduce

Try parsing attached file with some msgpack parser (e.g. this Haskell library).

Expected behaviour

File is parsed.

Actual behaviour

File is not parsed due to wrongly encoded bytestring field.

georgeee commented 3 years ago

Here is the source of the bug: https://github.com/algorand/go-algorand/blob/3fce706479573bcee6c5cadd5465cd32e7175328/data/basics/teal.go#L46

ian-algorand commented 3 years ago

Need to do some initial investigation on this before we jump into implementing a solution.

kirelagin commented 3 years ago

Let me chime in to, hopefully, clarify what is happening, at least the way I understand it.

In Go there are two, essentially, equivalent types: string and []byte – the only difference between them is that the former is immutable. That is in contrast to other modern programming languages where “string” means a sequence of Unicode codepoints rather than bytes.

I don’t have a lot of experience with Go, but, I think, there is no universal agreement that e.g. string should only ever be used for valid UTF-8 encodings, although that kind of seems to be the intent of the creators of the language.

Now, in MessagePack, there are two types: String (and its corresponding format str) and Binary (and its corresponding format bin). In line with the interpretation of higher-level languages, the MessagePack specification requires String to be a valid UTF-8 encoding of a Unicode string.

The final component of the issue is the go-codec library, which, by default, maps Go string to MessagePack str (and Go []byte to MessagePack bin). The first part of this mapping is incorrect in general, since in Go string can contain arbitrary bytes, but in MessagePack str cannot.

So, there are two ways to fix this situation:

  1. Make sure that all structs use []byte instead of string at all places where arbitrary bytes, rather than correct UTF-8-encoded codepoints, can be found.
  2. Keep using string, but set StringToRaw = true in the codec configuration (however, if there are places in the protocol where genuine strings are used, they will be serialised to bin now; not a big deal and, I believe, the only reason to prefer str over bin where possible is that its serialisation is more compact).

Either way, current serialisation procedure is, unfortunately, not compliant with the specification and will definitely cause problems for standard-compliant decoders in languages that have a “real” string type. E.g. the transaction linked in the first comment cannot be decoded by py-algorand-sdk, since Python’s msgpack will fail to decode arbitrary bytes as a string:

>>> req = urllib.request.Request('https://api.betanet.algoexplorer.io/v2/blocks/9414973?format=msgpack', headers={'User-Agent': 'Not Python'})
>>> msgpack.unpack(urllib.request.urlopen(req))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kirelagin/proj/serokell/py-algorand-sdk/venv/lib/python3.8/site-packages/msgpack/fallback.py", line 129, in unpackb
    ret = unpacker._unpack()
  File "/home/kirelagin/proj/serokell/py-algorand-sdk/venv/lib/python3.8/site-packages/msgpack/fallback.py", line 671, in _unpack
    ret[key] = self._unpack(EX_CONSTRUCT)
  File "/home/kirelagin/proj/serokell/py-algorand-sdk/venv/lib/python3.8/site-packages/msgpack/fallback.py", line 671, in _unpack
    ret[key] = self._unpack(EX_CONSTRUCT)
  File "/home/kirelagin/proj/serokell/py-algorand-sdk/venv/lib/python3.8/site-packages/msgpack/fallback.py", line 644, in _unpack
    ret.append(self._unpack(EX_CONSTRUCT))
  File "/home/kirelagin/proj/serokell/py-algorand-sdk/venv/lib/python3.8/site-packages/msgpack/fallback.py", line 671, in _unpack
    ret[key] = self._unpack(EX_CONSTRUCT)
  File "/home/kirelagin/proj/serokell/py-algorand-sdk/venv/lib/python3.8/site-packages/msgpack/fallback.py", line 671, in _unpack
    ret[key] = self._unpack(EX_CONSTRUCT)
  File "/home/kirelagin/proj/serokell/py-algorand-sdk/venv/lib/python3.8/site-packages/msgpack/fallback.py", line 666, in _unpack
    raise ValueError(
ValueError: <class 'int'> is not allowed for map key

And either way, fixing this means breaking the wire format compatibility for Algorand.

kirelagin commented 3 years ago

(This might actually be worth filing an issue in go-crypt to change StringToRaw to true by default, because this way people will, at least, be aware of the caveat and it will not produce potentially incorrect encodings with default settings.)