DCIT / perl-Crypt-JWT

Other
54 stars 18 forks source link

suspect usage of utf8 functions #5

Closed jberger closed 8 years ago

jberger commented 8 years ago

There are several lines like utf8::(de|en)code($thing) if utf8::is_utf8($thing) which is almost certainly not what you want. Specifically is_utf8 does not tell you whether you have characters or bytes unfortunately.

karel-m commented 8 years ago

So what do you suggest? Would it be better to call utf8::encode(..) and/or utf8::decode(..) unconditionally?

jberger commented 8 years ago

decode_json and encode_json should handle the bytes->chars and chars->bytes for you. Is there some reason you feel the need to handle it yourself?

karenetheridge commented 8 years ago

Would it be better to call utf8::encode(..) and/or utf8::decode(..) unconditionally?

That would be better. You could also use some sort of config flag or other out-of-band mechanism to signal if the data should be encoded/decoded -- but the utf8 flag on the scalar does not carry this information.

karel-m commented 8 years ago

decode_json and encode_json should handle the bytes->chars and chars->bytes for you

This is IMO not true, try this code:

use utf8;
use JSON::MaybeXS;
my $json = '{"str":"žluťoučký kůň"}';
#utf8::encode($json);
my $hash = decode_json($json);
print $hash->{str} eq "žluťoučký kůň";

Without calling utf8::encode($json) it dies: Wide character in subroutine entry at test.pl line 5.

karenetheridge commented 8 years ago

decode_json expects bytes, not characters (and will give you characters in the inflated data structure). Your example is not simulating the bytestream that your encryption algorithm would provide.

karel-m commented 8 years ago

OK, so no utf8::decode before decode_json

And what would you expect in this scenario?

  my $payload = 'žluťoučký kůň';
  my $token = encode_jwt(key=>$k, payload=>$payload, alg=>$alg, enc=>$enc);

How should encode_jwt treat $payload containg utf8 string?

jberger commented 8 years ago

If the goal is just to convert characters to bytes (without then encoding to json) I'd probably suggest Encode::encode(utf8 => $payload) but I'm not exactly sure what you are attempting to do.

jberger commented 8 years ago

For example, your example above is better written as

use strict;
use warnings;

use Encode 'encode';
use JSON::MaybeXS;
my $json = encode 'utf8' => '{"str":"žluťoučký kůň"}';
my $hash = decode_json($json);
print $hash->{str} eq "žluťoučký kůň";
karel-m commented 8 years ago

The thing is that if I pass utf8 string with wide characters like $payload = 'žluťoučký kůň' to underlying crypto module, it will croak something like: Wide character in subroutine entry at /.../Crypt/AuthEnc/GCM.pm line 26

The question is whether Crypt::JWT or underlying crypo Crypt::AuthEnc::GCM (BTW also mine) should simply refuse input that is not bytes or whether it should try to do some "magic".

The "encode_jwt" side is a bit simpler as we either have bytes or characters that we can convert to bytes (I used to thought that is_utf8 tells me which case we have).

The "decode_jwt" side is a bit more tricky as we always get bytes and IMO we should simply return bytes.

BTW what is the difference between Encode::encode_utf8($payload) and utf8::encode($payload)?

Grinnz commented 8 years ago

I think you should clearly document where the module expects and returns either characters or bytes (normally, within perl code, one tries to work in decoded characters wherever possible; under "use utf8", your example strings above are decoded characters, but note that JSON is usually represented as UTF-8 encoded bytes).

is_utf8 does not tell you whether you have characters or bytes. It tells you which internal format the string is in, which is only relevant to XS code that uses the underlying C string directly. The only way to "determine" whether a string is characters or bytes is to guess by trying to decode it as UTF8 and seeing if it fails. Being explicit with your input and output is very preferable to this.

karel-m commented 8 years ago

Would it be a good idea to use this?

utf8::downgrade($payload, 1) or croak "JWT: payload cannot contain wide character";
karel-m commented 8 years ago

please have a look at 2e97beb

Grinnz commented 8 years ago

In that case, if payload is expected to be already-encoded bytes, that check should be ok. The test looks correct to me.

For your question on the difference between Encode::encode_utf8 and utf8::encode, they are generally the same as I understand it, and will encode the string using Perl's internal loose UTF-8 format. Encode::encode('UTF-8', $str) is generally more correct as it will use strict UTF-8 encoding, but this also means it can fail. Similar to this is Unicode::UTF8::encode_utf8.

karenetheridge commented 8 years ago

I think your module should always expect a payload of characters, and you do both the utf8- and json-encoding in the module. On the other end, you will return characters as well, doing both the json- and utf8- decoding.

The only time code should be dealing with bytes is when at the outermost layers of the system -- when reading/writing to/from the network or disk.

jberger commented 8 years ago

The only time code should be dealing with bytes is when at the outermost layers of the system -- when reading/writing to/from the network or disk.

Or when building a specific document type, as this module is doing. JWT is analogous to JSON in that it is defined as a document with a specific encoding.

karel-m commented 8 years ago

I think your module should always expect a payload of characters

Well, this is quite hard for a crypto module as the crypto world is more like "bytes in / bytes out".

Analogy (en|de)code_jwt vs. (en|de)code_json makes sense - in that case the last patch is similar to https://metacpan.org/source/MAKAMAKA/JSON-PP-2.27400/lib/JSON/PP.pm#L654

karel-m commented 8 years ago

I have released Crypt-JWT-0.013 with the above mentioned changes.

Thanks for your time.