ap / Bencode

BitTorrent serialisation format
https://metacpan.org/release/Bencode
4 stars 2 forks source link

Need clear behavior for strings with utf8 flag set #2

Open dolmen opened 11 years ago

dolmen commented 11 years ago

The base bencoding format specifies encoding only for strings of bytes, not Unicode strings. When decoding, there is no way to distinguish if the original data was an UTF-8 string or a byte buffer that appears to look like an UTF-8 string.

The module documentation should clearly specify (and the implementation be tested) how Perl strings with the utf8 flag given in the input to bencode will be handled. Throwing an exception would be an appropriate behavior, in order to force the user of the module to properly encode its data as bytes.

The bdecode function should clearly disallow a string of characters and allow only a string of bytes.

ap commented 11 years ago

The UTF8 flag is irrelevant, and whether the string was bytes or characters cannot be known outside of one particular case. Namely, if the string matches /[^\x0-\xff]/, then it contains some wide characters, so it cannot be a (proper) byte string. (It may contain bytes mixed in with the characters if the code constructing it is buggy.) But a string that does not match this could be anything, regardless of whether its UTF8 flag is set.

In any case you are right, the docs should declare how the module handles this issue.